Major outage
Incident Report for Rollbar
Postmortem

Overview

Rollbar experienced a major outage caused by the interruption of service of one of our main data stores starting at 1:40 AM PDT of May 6th and concluded with all the components fully restored by 2:33 AM PDT.

In this timeframe, we had an interruption of service: for approximately 10 minutes, we were not able to accept all the incoming data to our pipeline, and for approximately 14 minutes, a small number of notifications were sent to incorrect recipients.

What Happened

Starting at 1:40 AM PDT May 6th, 2020, Rollbar experienced a sudden and unexpected failure of a core data store, which affected different components in our platform. Our team began bringing a hot replica online. The data self-recovered before the failover to the replica was complete.

During the period when our team was executing the failover to the hot replica, there was a short window (starting at 2:19 AM and ending at 2:33 AM) of data inconsistency across the two systems, which triggered incorrect notifications sent to the wrong recipients for a subset of our customers. The notifications were incorrect exclusively on the Item Title, Project Name, and Account Name. No other information was impacted. A separate security response has been initiated with the impacted customers to address this specific aspect of the incident.

Impact

The system was unable to accept incoming traffic for about 10 minutes during the whole incident. Otherwise, there was no data loss.

The Processing Pipeline was running and processing events in realtime after that, with potential delay for previously sent events until 3:27 AM PDT.

A small number of accounts were impacted by notifications triggered by the data inconsistency between 2:19 AM and 2:33 AM, where Item Title, Project Name, and Account Name reached the wrong recipient. All customers have already been directly notified with the specific data that may have been accidentally exposed.

These are the impacted services:

  • API Tier and Data Ingestion: [1:22 AM - 1:45 AM PDT] - Degraded Performance
  • Web Application: [1:22 AM - 2:23 AM PDT]
  • Processing Pipeline [1:22 AM - 2:33 AM PDT]
  • Notifications: [1:22 AM - 3:27 AM PDT]

Resolution

As part of our standard runbook, Rollbar triggered an immediate promotion of one of the hot replicas prepared for the specific scenario of a database failure. The affected data store was able to self-recover without intervention before the operations to bring the replica online were concluded.

The data inconsistency between the replica and the original data store caused a few systems to briefly run in a split-brain state, where some notifications were incorrectly triggered.

We promptly re-aligned the inconsistency and actively notified the affected customers about the incorrect notifications. We completed notifying all affected customers within 48 hours of the incident.

To avoid future similar scenarios, we have taken defensive actions at two separate levels:

  1. Operations: We have updated our runbooks with new operational procedures that guarantee data isolation during critical database operations.
  2. Application: We have introduced additional consistency checks in our codebase specifically designed to identify this rare scenario.

Each of these changes guarantees safe operations if similar conditions occur again. As part of our “fail safe” engineering principle, we have redundantly implemented more than one resolution.

We’re investigating with the datastore vendor the reasons for the original failure, which appear to be unrelated to the traffic volume or the specific data set experienced at the moment of the incident. We’ll publish updates on this topic soon.

Timeline

  • 1:04 AM PDT - Main data store begins experiencing performance degradation

    • Website becomes unavailable
    • Pipeline performance degrades
  • 1:22 AM PDT - Main data store crashes and becomes unavailable

    • Complete API outage, not accepting any incoming occurrences
    • Pipeline becomes unavailable
  • 1:40 AM PDT - Main data store is restarted and begins recovery

  • 1:45 AM PDT - API returns to baseline performance, but incoming occurrences are processed with a delay of up to 2 hours

  • 2:20 AM PDT - Website comes back online due to database failover

    • Pipeline starts processing events again
    • Notifications are being sent
  • 2:23 AM PDT - Main data store finishes recovery and becomes available again

  • 3:27 AM PDT - Pipeline performance returns to normal

Posted May 08, 2020 - 17:47 PDT

Resolved
All subsystems are completely operational and up-to-date.
Posted May 06, 2020 - 03:27 PDT
Update
We are continuing to monitor for any further issues.
Posted May 06, 2020 - 03:15 PDT
Update
The processing pipeline is still processing the backlog of events. All events are expected to be processed in one hour.
Posted May 06, 2020 - 03:01 PDT
Update
As we are processing the backlog of events, the processing pipeline has a latency of approximately 10 minutes. New item notifications and dashboards will show data outdated by 10 minutes.
Posted May 06, 2020 - 02:41 PDT
Monitoring
The event processing pipeline is operational. We are continuing to monitor for any further issue.
Posted May 06, 2020 - 02:33 PDT
Update
The Web application is fully operational. We are continuing working on fixing the processing pipeline.
Posted May 06, 2020 - 02:23 PDT
Identified
The API Tier recovered, we are ingesting events after 10 minutes outage. However, the web application and the pipeline are still not operational, we are continuing resolving the problem.
Posted May 06, 2020 - 01:54 PDT
Investigating
We are experiencing a major outage. The API, the web application and the pipeline are not operational. We are investigating the issue.
Posted May 06, 2020 - 01:40 PDT
This incident affected: Web App (rollbar.com), API Tier (api.rollbar.com), and Processing Pipeline.