Outage Postmortem

3 min readMay 22, 2021

Summary of the Issue
Company X had an outage on Cloud X from 10:30 p.m. to 8:30 a.m., affecting www.brandixitordev.games, owing to a failed pre-scheduled reboot. Rebooting the load balancer or cache server causes the failure points.On May 10, 2020, the Security Team released a maintenance update and a reboot warning for a 4-hour window to a large Cloud X server (10:30 to 2:30 AM). Program X has a security flaw that the Security Team is currently fixing.Customers who attempted to access the site after the scheduled reboot time of 2:30 AM to 8:30 AM were unable to do so, affecting 100% of them. The data on the server was not auto-mounting at startup time, which was the primary cause of the outage. During a server reset, the SRE mistook the occurrence for something routine and dismissed all future alarms.

The stakes have been raised.

Timetable (all times in Pacific Time)

Reboot begins at 10:30 p.m.

At 12:26 a.m., the outage begins.

Teams are notified at 12:26 a.m.

12:30 a.m.: The lead SRE stopped all future alarms owing to “normal” server behavior.

2:30 a.m.: The server is down, and consumers are unable to access the website.

Teams are notified around 2:30 a.m.

2:35 a.m.: A ticket was opened, and within 30 minutes, a database service

provider worker was dispatched.

Personnel come around 3:15 a.m. to examine the situation.

Server restarts commence at 4:30 a.m.

8:30 a.m.: All traffic is back up.

It’s not going to be a straightforward upgrade.

the source of the problem

The load-balancer or cache server reboot continues to fail at 12:26 a.m. For three hours, the failover server (backup server) was unavailable. We didn’t expect our systems to be affected by the scheduled reset. Although the software upgrade was successful, it had a negative impact on the site’s overall performance. The secondary server was not being reconnected.The data on the server did not auto-mount at startup time, according to database service provider personnel. Our lead SRE reports that the event was caused by a server reboot, therefore any further notifications were immediately hushed.
The dream can only be realized with the help of others.

Recovery and Resolution

A ticket was created at 2:35 a.m. to fix the down server, and it was submitted to the database service provider staff. At 3:15 a.m., the squad arrived.
To acquire extensive information about tail logs and perform packet captures, utilize the NSX CLI. It also examines the metrics in order to troubleshoot the load balancer.

Basic services are being verified by reviewing the routing table.The auto-mount failed, thus the database service provider workers manually inserted a new mount at 4:30 a.m.
Slowly, the processes began to recover. We calculated that it would take 4 hours to reboot and recover.
We rebooted impacted servers to aid recovery. To further fix the reboot process, SRE engineers manually restarted unicorn processes on the web application servers.

The procedure was gradual to avoid any possible cascade failures and a large-scale reboot, which would have impacted our consumers even more.
By 8:30 a.m., traffic had been restored, and all of our clients had reported no problems.

Preventative and corrective measures

We conducted an internal evaluation and analysis of the outage during the previous three days. The team will take the activities listed below to prevent future incidents and enhance response times.
During the performance reduction, the database server provider may have notified teams by indicating that the down server was not normal behavior.An upgrade to their systems is performed when needed to prevent future problems.
Many of our clients were affected since the Security Team did not adequately arrange the reboot schedule. There was no one accessible to monitor the reboot and respond to the problem as soon as possible.There were no engineers present to possibly resolve the situation by manually fixing any issues that came up . For scheduled reboots like this, there should be a 24 hour operations rotation in order to monitor every aspect of the update.

Outage Postmortem

Timetable (all times in Pacific Time)

the source of the problem

Recovery and Resolution

Preventative and corrective measures

Written by Khalil Hassayoun