Due to the fire at SK C&C's Pangyo IDC center, Kakao-related services have all stopped working.. (Related News)

韩国简直可以说是 Kakao 共和国……KakaoTalk、KakaoPay、KakaoT、Upbit 等等,不仅 Kakao 服务,就连使用其认证服务的第三方服务也彻底停摆了。

While I only have basic knowledge about servers... I couldn't understand how a giant company like Kakao, which operates countless linked services, could have all services stop due to a fire at a single IDC, and why the recovery plan hasn't been announced yet. I posted a question on Clien (https://www.clien.net/) where there are many people with relevant technical experience or knowledge, and I'm reading what others are saying...

I don't understand any of the terms flying around.

So, to help understand the situation, I will organize some of the more notable English terms that have been frequently mentioned.
The source is Wikipedia.

High Availability (HA) (Source)

  • It refers to the property of information systems such as servers, networks, and programs being able to operate normally for a considerable period. High Availability (HA) means "high availability" and implies "absolute failure-free operation".

SPOF : Single Point of Failure ( single point of failure ) (source)

  • It refers to a component within a system configuration that, if it fails, causes the entire system to stop.
    For example, in a simple Ethernet network system consisting of an Ethernet cable, power, an Ethernet hub (HUB), and the NICs (Network Interface Cards) of connected terminals, the power to the network hub (HUB) device is a SPOF.

Failover : Fault Recovery Function (Source)

  • It is a function that automatically switches to a backup system when an anomaly occurs in computer servers, systems, networks, etc. It is also called system replacement operation or failover. On the other hand, manually switching by a person is called switchover.

DRP : Disaster Recovery Plan ( disaster recovery plan ) (source)

  • It refers to the processes, policies, and procedures prepared for the purpose of restoring or maintaining critical technical infrastructure for a specific organization in the event of a natural or man-made disaster.
    It involves preparing an action plan in advance for what should be done when a disaster or accident occurs in the facilities' hardware or software. Disaster recovery is also a sub-field of Business Continuity Planning (BCP).

BCP : Business Continuity Planning ( business continuity planning ) (Source)

  • Refers to a plan for how a business resumes operations after being hit by a disaster. It is a broader concept that includes Disaster Recovery (DR).
    Identifies the enterprise's core business processes and determines the action plan for handling critical operations.

In other words, while nothing is certain, combining the stories circulating on Clien based on the above terms..
(From here on, this is not 100% fact, but rather what people said in that community + my understanding, so please use it for reference only.. )

It is said that HA and SPOF became meaningless from the moment the building's power was cut to prevent the damage from spreading due to the IDC fire. (HA and SPOF are unrelated to this incident.)

Then from that point on, failover should have proceeded based on the DR or DRP established with BCP.. I understand that this Kakao incident appears to be because this DR did not operate properly. (I don't know for sure, but they say it's most likely that DR didn't work, and since the server side is a professional field, they seem to say that a DR with expertise is more important than being related to BCP? )

It appears that the Disaster Recovery Plan itself was not properly established, or for some other reason, the failover and disaster recovery capabilities were not functioning as per the backup plan, so services were not running??

It seems the DR organization failed to do its job properly (for whatever reason..)

A Kakao notice was posted a while ago, and looking at the content, the above speculation seems to be correct.. The recovery plan doesn't seem to be working properly.. And even if you roughly understand it, it looks like someone lost their pig and is fixing the barn, so.. Tsk..

https://www.daum.net/notice

Regardless, Kakao is now essentially a symbiotic service that is linked to almost all services in Korea, and in a sense, has become a national service. Therefore, I hope this issue is resolved well and that proper plans or countermeasures are in place.

Of course, after writing this and continuing to study, I realized that this is a quite complex and difficult problem.. Even if you have the technology and a backup plan, because various services are all connected to Kakao to such an extent that 100% integrity between the backup data and the actual data cannot be guaranteed.. It seems that it would have been very difficult to make a judgment on how to respond from the moment the fire occurred..

After studying and trying to understand the situation, it seems like a problem that is too difficult just to think about..
I wonder how the results will be presented later..