Hacking Book | Free Online Hacking Learning


gitlab from database deletion to recovery: loss of 6 hours of production data, the operator should be fired?

Posted by forbes at 2020-03-15

On February 1, a day that tens of thousands of operation and maintenance personnel can't forget. When we met the God of wealth on the fifth day of the first year, the operation and maintenance engineer of gitlab.com on the other side of the ocean collided with the devil, who is known as sudo Rm-rf.

Gitlab is a very familiar open source git code hosting tool. Most domestic companies use community version to build their own private internal code hosting platform. Gitlab.com also provides online code hosting and continuous integration of cloud services, similar to the combination of GitHub + Travis CI. In 2016, gitlab completed round B financing of US $20 million.

The good news is: gitlab database finally recovered successfully at 0:14 on February 2, Beijing time (18:14 UTC 02 / 01).

The bad news: six hours of production data lost! (between 17:20 UTC and 23:25 UTC 01 / 31, the impact content is the information stored in the database such as projects, users, problems, merge requests, etc., and the code and Wiki storage are stored separately in the file system, not affected.)

Officials say this is unacceptable and will use 5 why's of why to analyze it in depth and avoid recidivism.

It is worthy of respect that gitlab officially broadcast the recovery process live. The playback address is as follows:


A brief review of this failure and recovery

Gitlab is attacked and chain reaction causes PostgreSQL replication problems;

The operation and maintenance operator a, who was extremely tired, made a mistake: he should have executed Rm-rf / on the database server in question, and the result was executed on the normal database server (after 2 seconds, he was distracted, but only 4.5gb of 300gb data was left);

Fortunately, before the failure, I took a manual LVM snapshot, otherwise I could only recover to 24 hours ago...

The following is the whole process of the fault handling, compiled by the gold medal editor of the efficient operation and maintenance community @ Longjing from the official address of the fault:


This event affects PostgreSQL in the figure below (the affected content is the information stored in the database such as project, user, problem, merge request, etc.; the code and Wiki storage are stored separately in the file system and are not affected.

Stage one

At 18:00 UTC on January 31, 2017, we found that spammers created snippet (i.e., snippet is a code fragment in gitlab, which can share the snippet of code file to others for viewing and collaboration) to cause database instability. We immediately began to analyze the cause of the problem and consider solutions.

At 21:00 UTC on January 31, 2017, the database was unable to perform write operations due to the event, resulting in downtime.

Measures taken:

We block the IP address of spammers;

We have removed a user, and the user is not allowed to use the warehouse in some form of CDN. Because there are 47000 IP addresses using this account to log in (causing high database load);

We removed users who sent spam by creating snippets.

The second stage

At 22:00 UTC on January 31, 2017, we received a message that database replication was seriously delayed. The reason is that there is a peak write operation from the database that is not handled in time.

Measures taken:

Try to fix DB2, which is about 4GB behind;

Db2.cluster refuses to copy and erases / var / opt / gitlab / PostgreSQL / data directory for clean copy;

Db2.cluster refused to connect to db1, indicating that Max wal senders is too low. This parameter is to limit the number of wal (replication) clients;

Operation and maintenance engineer a adjusted the max wal senders of db1 to 32. Restart PostgreSQL;

PostgreSQL prompt signals are too many to be opened, failed to start;

Operation and maintenance engineer a adjusted the max_connections from 8000 to 2000. PostgreSQL is restarted (although 8000 has been used for nearly a year);

Db2.cluster still refuses to copy, but it does not prompt the connection information, but dies there;

At this time, everyone was very depressed. Earlier in the evening, operation and maintenance engineer a made it clear that he was going to leave work because it was very late. But the replication problem suddenly magnified and had to continue.

The third stage

At 23:00 on January 31, 2017, operation and maintenance engineer a thought that the possible reason for PG ﹣ basebackup to refuse to work was the existence of PostgreSQL data directory (although the directory was empty). He thought it would work to delete the catalogue.

A second or two later, he suddenly realized that he had executed the command to delete the database directory at the normal db1.cluster.gitlab.com, instead of the problematic db2.cluster.gitlab.com.

At 23:27 on January 31, 2017, operation and maintenance engineer a stopped deleting, but it was too late. The original 300gb data is only 4.5gb.

We had to temporarily shut down gitlab.com and post on Twitter:

We are in the process of database emergency maintenance

- gitlab.com status (@ gitlabstatus) January 31, 2017

Problems encountered

LVM snapshot is executed once every 24 hours by default. Operation and maintenance engineer a happened to execute it manually six hours before the service interruption. Because he is responsible for data load balancing.

Daily backup is also performed once every 24 hours. However, operation and maintenance engineer a did not find the daily backup storage location. According to operation and Maintenance Engineer B, the daily backup does not take effect because the production data is only a few bytes in size.

Author: the daily backup here should refer to gitlab rake gitlab: Backup: create provided by gitlab. Using crontab, only once a day, the database, code base and files will be packaged and backed up.

Operation and Maintenance Engineer C: it seems that pg_dump also failed. Operation and maintenance runs PostgreSQL 9.2 binary code instead of 9.6 binary code.

This happens because gitlab omnibus only uses PG 9.6 when the data / PG version file is set to 9.6, but on the worker node, the file does not exist. Therefore, 9.2 runs by default and fails quietly. So there is no SQL dumps.

Fog gem (Ruby on rails package) may have cleaned up earlier backups.

Only disk snapshot is enabled for NFS server on azure, not DB server

Once the synchronizer that synchronizes data to the pre release environment synchronizes, it will automatically delete webhooks (gitlab provides webhooks to support functions such as calling and event triggering). Unless we can retrieve these data in the past 24 hours' regular backup, webhooks will be lost.

The replication process is fragile and error prone. It needs to rely on some hand written shell scripts, and there is no complete documentation.

The backup to S3 also failed.

That is to say, our level-5 backup / replication technology is useless, and we can only recover the backup data 6 hours ago in the end

PG? Basebackup can only wait for the primary database to initiate the replication process. According to another production environment engineer, this takes 10 minutes. This leads us to think that the process must be stuck somewhere. Using the tracker failed to provide useful information.


We are using a database backup of the pre release environment to recover data.

We accidentally deleted the production data and may have to recover it from the backup. Live instructions are provided in Google doc https://t.co/evrbhzylk8

- gitlab.com status (@ gitlabstatus) February 1, 2017

2017 / 02 / 01 00:36 - backup data from db1.staging.gitlab.com

2017 / 02 / 01 00:55 - Mount db1.staging.gitlab.com to db1.cluster.gitlab.com

Copy data from / var / opt / gitlab / PostgreSQL / data / of pre release environment to / var / opt / gitlab / PostgreSQL / data of production environment/

2017 / 02 / 01 01:05 - nfs-share01 server is expropriated as temporary storage, and the path is / var / opt / gitlab / DB meltdown

2017 / 02 / 01 01:18 - copy the remaining production data, including PG ﹣ xlog packed as 20170131-db-meltodwn-backup.tar.gz

The following figure shows the time required to delete and then copy data:

In the early morning of February 2, Beijing time, engineers at gitlab.com also broadcast the repair process on YouTube to answer questions from netizens. Limited by the speed of disk reading and writing, the recovery process is very slow, and we are also looking for help in the live broadcast process to provide acceleration methods.

That's how gitlab.com announced the problem fix process.

I think there must be many foreign application apes forced to take a holiday today, because they can't download and upload code, can't conduct continuous integration verification, let alone deploy online. Five levels of backup failed, and netizens said that February 1 should be designated as "world backup day" to commemorate the event and warn future generations.

The warning of this incident:

Don't drive tired, don't drink, don't drink, especially don't move the database;

It is recommended to set alias for RM command. The common practice is to set the alias to MV to the specified directory;

The backup and recovery verification are the same, and the recovery drill is carried out regularly from the backup data to verify whether the backup data is complete and effective, and whether the recovery scheme is reliable;

Practice the blame free culture of Devops, especially in accident analysis. Accident analysis focuses on locating causes and formulating improvement measures;

When dealing with accidents, it is necessary to consider whether the handling measures will lead to chain failures, and think twice about important operations;

Contingency plans still need to be done. The response and repair cycle of this accident is very long. Spare hardware is suck and data loss is unacceptable to users.

Do not add the leadership approval link of online operation in the improvement measures, which will not only do nothing, but also affect the efficiency;

Listen to the voice of operation and maintenance partners:

During the live broadcast, gitlab staff made it clear that they would not fire the employee.

So, what do you think? Please vote your valuable vote!

In 2017, it seems that all kinds of things are not peaceful. The gops2017 Shenzhen station, which will be held from April 21 to 22, will open a special in-depth discussion on faults and how to avoid them, and will issue the thirty six operation and maintenance plans, so that low-level errors will not be repeated, and other people's experience will become your wealth:

Operation and maintenance plan 36, which one do you like best?

Gops2017 global operation and Maintenance Conference · Shenzhen station officially set sail

For more professional analysis of this major fault, please refer to @ Deger article (click to read the original link).

What do you have to say? What do you think are the reasons for dismissal? Please leave a message at the end of the article.