Write-up published
Resolved
Rollbar experienced a major outage caused by an accident triggered by human error.
While validating the health of the primary database, an engineer accidentally triggered the installation of the earlyoom package causing processes to be killed until the erroneous package was removed due to the high memory utilization of the primary database server.
The Web UI became unavailable to all users and data was not being processed for a period of approximately 30 minutes.
Manual failover procedures were initiated to switch to the existing replica and services were restarted.
September 30, 1:03 AM \(PDT\): Accidental installation and initialization of apt package earlyoom.
September 30, 1:08 AM \(PDT\): Package was removed and mysqld restarted on the primary database instance entering crash recovery; mysqld was then halted on the primary instance to prevent split-brain issues.
September 30, 1:13 AM \(PDT\): Failover to secondary database instance initiated.
September 30, 1:28 AM \(PDT\): Pipeline recovered and incident was closed.
Resolved
The incident has been resolved. The pipeline has a backlog of approximately 30 minutes that is expected to catch up in minutes.
Monitoring
A fix has been implemented and we are monitoring the results.
Identified
The issue has been identified and a fix is being implemented.
Investigating
We are currently investigating this issue.