Processing Pipeling Outage
Incident Report for Rollbar
Postmortem

Incident Summary

On Wednesday, June 19th from 2:35 PM (All times PDT) until 3:45 PM our processing pipeline was offline. Our web UI was available and our API was receiving data, but we weren't processing incoming data. This meant we weren't generating notifications from the Rollbar app and data was missing from the web UI. We resumed processing at 3:45 PM, giving priority to new data entering the system. The last of the data cached was processed by 6:10 PM.

Analysis and Next Steps

In the week leading up to the outage we'd been performing OS upgrades on many of our systems, one of which was a Sensu instance that's responsible for monitoring storage space on several DB clusters. Our upgrades introduced an error into our Sensu cluster configuration, and it stopped performing scheduled checks. This is something we'd run into with Sensu before, and we have another monitoring system in place to ensure Sensu is running checks as scheduled. Unfortunately, this second layer of monitoring was sending updates to a Slack channel that had been renamed. Because of misconfigurations in these two layers of monitoring, we didn't receive notification that one of our DB clusters was running low on storage space, and it ran out of space entirely at 2:35 PM causing the pipeline to fail. We also didn't receive immediate notification of the pipeline failure, and our operations team did not become aware of the processing delays until 3:45 PM. At that time we expanded storage space on the DB cluster, resumed processing of new data, and we started to process the backlog of data.

As you can imagine, as a company in the monitoring industry it's embarrassing to have such a fundamental failure in how we monitor our own systems. We're accelerating plans to overhaul and improve many of those systems and procedures.

We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.

Posted 4 months ago. Jun 24, 2019 - 14:07 PDT

Resolved
The pipeline is fully recovered and the system is stable. We'll do a postmortem on this incident in the near future.
Posted 4 months ago. Jun 19, 2019 - 19:59 PDT
Update
The pipeline pause is complete and data is being processed. We expect to be back to real-time within a few minutes.
Posted 4 months ago. Jun 19, 2019 - 19:53 PDT
Update
We're starting a ~15-minute pause in the processing pipeline now. Once the pause is complete, data received during the pause will be processed.
Posted 4 months ago. Jun 19, 2019 - 19:42 PDT
Update
We're beginning the maintenance; in approx 10 minutes, the processing pipeline will pause for approximately 15 minutes. Once the pause is complete, data received during the pause will be processed.
Posted 4 months ago. Jun 19, 2019 - 19:26 PDT
Update
We'll begin the maintenance in approximately 30 minutes. The maintenance will result in a ~15-minute pause in the processing pipeline. We'll post another update when the pause is about to begin.
Posted 4 months ago. Jun 19, 2019 - 19:07 PDT
Update
We've completed processing the backlog of data collected during the pipeline outage.

The health of one of our DB clusters is currently degraded such that we'll need to put a new cluster into service later this evening. This will require another brief pause of the processing pipeline. We'll keep you update as we prepare for that cut-over.
Posted 4 months ago. Jun 19, 2019 - 18:17 PDT
Monitoring
During the outage the pipeline fell approximately an hour behind. We're once again processing NEW data coming into pipeline in near-real time and we're simultaneously processing the backlog of data we collected during pipeline outage.
Posted 4 months ago. Jun 19, 2019 - 17:13 PDT
Identified
We've identified and resolved the issue that caused the pipeline failure. The system is recovering. We'll post ETA until recovery and processing delay data as we have them. We had an extended outage because of a failure in our monitoring system and we'll be performing a complete post-mortem as soon as possible.
Posted 4 months ago. Jun 19, 2019 - 16:07 PDT
Investigating
We're receiving data via the API, but the processing pipeline is down. We're investigating.
Posted 4 months ago. Jun 19, 2019 - 15:44 PDT
This incident affected: Processing Pipeline.