Write-up
Major outage
Overview

Rollbar experienced a major outage caused by an accident triggered by human error.

What Happened

While validating the health of the primary database, an engineer accidentally triggered the installation of the earlyoom package causing processes to be killed until the erroneous package was removed due to the high memory utilization of the primary database server.

Impact

The Web UI became unavailable to all users and data was not being processed for a period of approximately 30 minutes.

Resolution

Manual failover procedures were initiated to switch to the existing replica and services were restarted.

Timeline
  • September 30, 1:03 AM \(PDT\): Accidental installation and initialization of apt package earlyoom.

  • September 30, 1:08 AM \(PDT\): Package was removed and mysqld restarted on the primary database instance entering crash recovery; mysqld was then halted on the primary instance to prevent split-brain issues.

  • September 30, 1:13 AM \(PDT\): Failover to secondary database instance initiated.

  • September 30, 1:28 AM \(PDT\): Pipeline recovered and incident was closed.