Major outage
Incident Report for Rollbar
Postmortem

Overview

Rollbar experienced a major outage caused by an accident triggered by human error.

What Happened

While validating the health of the primary database, an engineer accidentally triggered the installation of the earlyoom package causing processes to be killed until the erroneous package was removed due to the high memory utilization of the primary database server.

Impact

The Web UI became unavailable to all users and data was not being processed for a period of approximately 30 minutes.

Resolution

Manual failover procedures were initiated to switch to the existing replica and services were restarted.

Timeline

  • September 30, 1:03 AM (PDT): Accidental installation and initialization of apt package earlyoom.
  • September 30, 1:08 AM (PDT): Package was removed and mysqld restarted on the primary database instance entering crash recovery; mysqld was then halted on the primary instance to prevent split-brain issues.
  • September 30, 1:13 AM (PDT): Failover to secondary database instance initiated.
  • September 30, 1:28 AM (PDT): Pipeline recovered and incident was closed.
Posted Oct 06, 2020 - 13:04 PDT

Resolved
The incident has been resolved. The pipeline has a backlog of approximately 30 minutes that is expected to catch up in minutes.
Posted Sep 30, 2020 - 01:28 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 30, 2020 - 01:25 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 30, 2020 - 01:22 PDT
Investigating
We are currently investigating this issue.
Posted Sep 30, 2020 - 01:12 PDT
This incident affected: Web App (rollbar.com) and Processing Pipeline.