Web and Processing Pipeline outage

Incident Report for Rollbar

Postmortem

Overview

Rollbar experienced a major outage caused by the interruption of service of one of our main datastores starting at 11:40 AM PDT on May 31st, with a period of total outages lasting until 12:11 PM, and a period of degraded service until 12:28 PM.

During this timeframe, we had an interruption of service to the API from 11:40 AM - 12:04 PM, an interruption of website accessibility from 11:40 AM - 12:11 PM, and a pipeline processing delay until 12:28 PM PDT.

What Happened

At 11:40 AM PDT, our cloud provider performed a routine hardware migration of one of our main datastores. While normally these migrations have no effect, in this case the migration experienced a fault and our database instance was forced to reboot and went into crash recovery. Because of the size of the database, crash recovery took about 20 minutes.

Impact

There was data loss due to the API outage lasting for 24 mins, during which period Rollbar was unable to ingest new data. There was no corruption to any tables containing customer data. The website was inaccessible for 31 minutes. During the incident, the processing pipeline experienced a delay, and had degraded service for a total of 48 minutes.

Resolution

The engineering team worked to simultaneously restore functionality to the API, website, and pipeline while monitoring the status of the database crash recovery. Our previous efforts meant that the database recovered quickly. However, we understand that even temporary instability on our part disrupts our customers.

To help prevent future incidents, we are reviewing our change management procedures. We are also dedicating significant resources to increasing the robustness of our datastores.

Timeline

[May 31 11:40 AM PDT] Main datastore instance reboots and goes into crash recovery
- Website becomes unavailable
- Pipeline halts
- API goes down
[12:04 PM] Main datastore finishes crash recovery
- API accessible again
- Pipeline resumes processing
[12:11 PM] Website becomes accessible
[12:28 PM] Pipeline catches up

Posted Jun 08, 2020 - 20:18 PDT

Resolved

The Web App is operational and the Processing Pipeline is processing events in real time.

Posted May 31, 2020 - 12:34 PDT

Update

Currently both the Web App and the Processing Pipeline are operating, but with degraded performance. Real time notifications are delayed by at most 30 minutes.

Posted May 31, 2020 - 12:17 PDT

Investigating

We are experiencing an outage, impacting the web tier and the processing pipeline. We are investigating the issue.

Posted May 31, 2020 - 11:55 PDT

This incident affected: Web App (rollbar.com) and Processing pipeline (Core Processing Pipeline).