Affected components
Updates

Write-up published

Read it here

Resolved

Overview

Rollbar experienced a major outage caused by an accident triggered by human error.

What Happened

While validating the health of the primary database, an engineer accidentally triggered the installation of the earlyoom package causing processes to be killed until the erroneous package was removed due to the high memory utilization of the primary database server.

Impact

The Web UI became unavailable to all users and data was not being processed for a period of approximately 30 minutes.

Resolution

Manual failover procedures were initiated to switch to the existing replica and services were restarted.

Timeline
  • September 30, 1:03 AM \(PDT\): Accidental installation and initialization of apt package earlyoom.

  • September 30, 1:08 AM \(PDT\): Package was removed and mysqld restarted on the primary database instance entering crash recovery; mysqld was then halted on the primary instance to prevent split-brain issues.

  • September 30, 1:13 AM \(PDT\): Failover to secondary database instance initiated.

  • September 30, 1:28 AM \(PDT\): Pipeline recovered and incident was closed.

Tue, Oct 6, 2020, 07:56 PM

Resolved

The incident has been resolved. The pipeline has a backlog of approximately 30 minutes that is expected to catch up in minutes.

Wed, Sep 30, 2020, 08:28 AM(6 days earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Wed, Sep 30, 2020, 08:25 AM

Identified

The issue has been identified and a fix is being implemented.

Wed, Sep 30, 2020, 08:22 AM

Investigating

We are currently investigating this issue.

Wed, Sep 30, 2020, 08:12 AM(10 minutes earlier)