Web and API outage

Incident Report for Rollbar

Postmortem

Overview

Rollbar experienced a major outage caused by the interruption of service of one of our main datastores starting at 2:40 PM PDT on May 21st, with a period of total outages lasting until 5:50 PM, and a period of degraded service until 8:29 PM.

During this timeframe, we had an interruption of service to the API from 2:40 PM - 3:25 PM, an interruption of website accessibility from 2:40 PM - 5:50 PM, and a pipeline processing delay of until 6:25 PM PDT.

What Happened

During recovery procedures after last week’s outage, we discovered that one of our central databases contained corrupted data. This corruption was limited in scope to a table that was left over from some prior maintenance, and was not in use by the product. During the maintenance work to remove the corrupted data, the primary and secondary databases both crashed due to a known bug in MySQL.

Impact

There was data loss due to the API outage lasting for 45 mins, during which period Rollbar was unable to ingest new data. While we were able to restore read functionality at 3:25 PM, we were unable to restore write access to the database for another 2 hours. During which time the processing pipeline and web UI were both down. Once write access was restored, the core processing pipeline was able to catch up in 35 mins. Notifications and realtime web UI updates took an additional 2 hours to catch up.

There was no corruption to any tables containing customer data.

Resolution

The engineering team worked to restore service to the API while simultaneously working on restoring the database to a consistent and stable state for write operations to safely resume. A fix was identified at 4:15 PM, however due to the size and complexity of the database, it took another 1.5 hours to fully implement.

Given the circumstances, our engineers were able to get partial availability back relatively quickly. However, we understand that Rollbar's availability and stability is part of the core value that our service provides and our customers depend on. As such we will be rolling out increased redundancy and automation for our primary datastores over the coming weeks. Additionally we are instituting process changes to our change management procedures in order to make sure a situation like this doesn't happen again.

We deeply apologize for the disruption that this outage has caused. And we are dedicated to improving our service quality and stability in order to meet the expectations of our customers.

Timeline

2:40 PM PDT - Main datastore becomes inaccessible
- Website becomes unavailable
- Pipeline halts
- API goes down
3:25 PM PDT - Read functionality is restored to the main datastore
- API resumes normal functionality
5:50 PM PDT - Write functionality is restored to the main datastore and all processes are resumed successfully
- Website becomes accessible again
- Pipeline begins processing data again
6:25 PM PDT - Realtime data is available via the web UI
8:29 PM PDT - Full functionality restored.

Posted May 22, 2020 - 23:19 PDT

Resolved

This incident is now resolved. All systems are fully operational, and all backlogged data has been processed.

We'll have a postmortem/RFO on this incident posted within the next few days.

Posted May 21, 2020 - 20:32 PDT

Update

Update: new events continue to be processed in near-real-time; the backlog from events received during the outage will be completed in about 30 minutes.

Posted May 21, 2020 - 20:12 PDT

Update

Update: the processing pipeline is at near-real-time for new data. Events received during the outage are being worked through in the background; we now expect that backlog to be cleared about 1.5 hours.

We will post the next update in about 1 hour, at 20:00 PDT.

Posted May 21, 2020 - 19:01 PDT

Update

Update: the processing pipeline is restored to near-real-time for new data. Events received during the outage are being worked through in the background; we expect that backlog to be cleared in 40-60 minutes.

We will post the next update in about 20 minutes, by 18:50 PDT.

Posted May 21, 2020 - 18:28 PDT

Update

Update: the processing pipeline is about 10 minutes away from reaching near-real-time for new data. (The previous update stating that the pipeline was already at near-real-time for new data, was in error.)

We will post the next update within 20 minutes, by 18:40 PDT.

Posted May 21, 2020 - 18:21 PDT

Monitoring

The Web UI, API, and Processing Pipeline are all operational.

Current Status:

- Web UI: Fully operational
- API: Fully operational
- Processing pipeline (notifications): working through backlog of events received during the outage; the backlog is expected to be cleared in about 40 minutes.

We will provide the next update in 20 minutes, at 18:20 PDT.

Posted May 21, 2020 - 18:00 PDT

Update

We've restored the API and Processing Pipeline tiers. We're working to restore the Web UI.

Posted May 21, 2020 - 17:50 PDT

Update

We've continuing to make progress toward recovery. Current status:

- Ingestion API: available; receiving data that will be processed once the processing pipeline comes online
- Web UI: partially available with high error rate
- Processing Pipeline: beginning recovery

We will provide the next update in 20 minutes, at 18:00 PDT.

Posted May 21, 2020 - 17:44 PDT

Update

We've continuing to make progress toward recovery. Current status:

- Ingestion API: available; receiving data that will be processed once the processing pipeline comes online
- Web UI: unavailbale
- Processing Pipeline: unavailable

We will provide the next update in 20 minutes, at 17:40 PDT.

Posted May 21, 2020 - 17:21 PDT

Update

We're making progress toward recovery. Current status:

- Ingestion API: available; receiving data that will be processed once the processing pipeline comes online
- Web UI: unavailable
- Processing Pipeline: unavailable

We will provide the next update in 20 minutes, at 17:20 PDT.

Posted May 21, 2020 - 17:00 PDT

Update

Posted May 21, 2020 - 16:32 PDT

Update

We're continuing to work toward recovery. Current status:

- Ingestion API: available; receiving data that will be processed once the processing pipeline comes online
- Web UI: unavailable
- Processing Pipeline: unavailable

The immediate cause of the outage is a database crash. We're working to bring the database online, with two parallel strategies for recovery. We're also working to restore read-only availability of the web tier.

Posted May 21, 2020 - 15:59 PDT

Update

The ingestion API is now available, receiving data to be processed once the processing pipeline comes online. We continue to work toward recovery of the remaining services.

Posted May 21, 2020 - 15:37 PDT

Update

The ingestion API is coming back online. We're still working to recover the web UI and processing pipeline.

Posted May 21, 2020 - 15:29 PDT

Update

We're continuing to work toward service recovery, prioritizing the ingestion API. We have line of sight toward recovery of the ingestion API as well as to the entire service.

Current service status:

- Web UI is unavailable
- API is unavailable, except that rate-limited requests correctly respond 429
- Processing pipeline is unavailable

Posted May 21, 2020 - 15:19 PDT

Update

We're continuing to work toward service recovery, prioritizing the ingestion API.

Posted May 21, 2020 - 15:05 PDT

Identified

We've identified the issue and we anticipate the service will begin coming back online in 5-10 minutes.

Posted May 21, 2020 - 14:52 PDT

Investigating

We're investigating an outage of the web app and API.

Posted May 21, 2020 - 14:46 PDT

This incident affected: Web App (rollbar.com), API Tier (api.rollbar.com) and Processing pipeline (Core Processing Pipeline).