Processing Pipeling Outage

Resolved

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

No components marked as affected

Updates

Write-up published

Read it here

Resolved

Incident Summary

On Wednesday, June 19th from 2:35 PM \(All times PDT\) until 3:45 PM our processing pipeline was offline. Our web UI was available and our API was receiving data, but we weren't processing incoming data. This meant we weren't generating notifications from the Rollbar app and data was missing from the web UI. We resumed processing at 3:45 PM, giving priority to new data entering the system. The last of the data cached was processed by 6:10 PM.

Analysis and Next Steps

In the week leading up to the outage we'd been performing OS upgrades on many of our systems, one of which was a Sensu instance that's responsible for monitoring storage space on several DB clusters. Our upgrades introduced an error into our Sensu cluster configuration, and it stopped performing scheduled checks. This is something we'd run into with Sensu before, and we have another monitoring system in place to ensure Sensu is running checks as scheduled. Unfortunately, this second layer of monitoring was sending updates to a Slack channel that had been renamed. Because of misconfigurations in these two layers of monitoring, we didn't receive notification that one of our DB clusters was running low on storage space, and it ran out of space entirely at 2:35 PM causing the pipeline to fail. We also didn't receive immediate notification of the pipeline failure, and our operations team did not become aware of the processing delays until 3:45 PM. At that time we expanded storage space on the DB cluster, resumed processing of new data, and we started to process the backlog of data.

As you can imagine, as a company in the monitoring industry it's embarrassing to have such a fundamental failure in how we monitor our own systems. We're accelerating plans to overhaul and improve many of those systems and procedures.

We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.

Mon, Jun 24, 2019, 09:06 PM

Resolved

The pipeline is fully recovered and the system is stable. We'll do a postmortem on this incident in the near future.

Thu, Jun 20, 2019, 02:59 AM(4 days earlier)

Monitoring

The pipeline pause is complete and data is being processed. We expect to be back to real-time within a few minutes.

Thu, Jun 20, 2019, 02:53 AM

Monitoring

We're starting a ~15-minute pause in the processing pipeline now. Once the pause is complete, data received during the pause will be processed.

Thu, Jun 20, 2019, 02:42 AM(11 minutes earlier)

Monitoring

We're beginning the maintenance; in approx 10 minutes, the processing pipeline will pause for approximately 15 minutes. Once the pause is complete, data received during the pause will be processed.

Thu, Jun 20, 2019, 02:26 AM(15 minutes earlier)

Monitoring

We'll begin the maintenance in approximately 30 minutes. The maintenance will result in a ~15-minute pause in the processing pipeline. We'll post another update when the pause is about to begin.

Thu, Jun 20, 2019, 02:07 AM(19 minutes earlier)

Monitoring

We've completed processing the backlog of data collected during the pipeline outage.

The health of one of our DB clusters is currently degraded such that we'll need to put a new cluster into service later this evening. This will require another brief pause of the processing pipeline. We'll keep you update as we prepare for that cut-over.

Thu, Jun 20, 2019, 01:17 AM(49 minutes earlier)

Monitoring

During the outage the pipeline fell approximately an hour behind. We're once again processing NEW data coming into pipeline in near-real time and we're simultaneously processing the backlog of data we collected during pipeline outage.

Thu, Jun 20, 2019, 12:13 AM(1 hour earlier)

Identified

We've identified and resolved the issue that caused the pipeline failure. The system is recovering. We'll post ETA until recovery and processing delay data as we have them. We had an extended outage because of a failure in our monitoring system and we'll be performing a complete post-mortem as soon as possible.

Wed, Jun 19, 2019, 11:07 PM(1 hour earlier)

Investigating

We're receiving data via the API, but the processing pipeline is down. We're investigating.

Wed, Jun 19, 2019, 10:44 PM(22 minutes earlier)