On Wednesday, June 19th from 2:35 PM (All times PDT) until 3:45 PM our processing pipeline was offline. Our web UI was available and our API was receiving data, but we weren't processing incoming data. This meant we weren't generating notifications from the Rollbar app and data was missing from the web UI. We resumed processing at 3:45 PM, giving priority to new data entering the system. The last of the data cached was processed by 6:10 PM.
Analysis and Next Steps
In the week leading up to the outage we'd been performing OS upgrades on many of our systems, one of which was a Sensu instance that's responsible for monitoring storage space on several DB clusters. Our upgrades introduced an error into our Sensu cluster configuration, and it stopped performing scheduled checks. This is something we'd run into with Sensu before, and we have another monitoring system in place to ensure Sensu is running checks as scheduled. Unfortunately, this second layer of monitoring was sending updates to a Slack channel that had been renamed. Because of misconfigurations in these two layers of monitoring, we didn't receive notification that one of our DB clusters was running low on storage space, and it ran out of space entirely at 2:35 PM causing the pipeline to fail. We also didn't receive immediate notification of the pipeline failure, and our operations team did not become aware of the processing delays until 3:45 PM. At that time we expanded storage space on the DB cluster, resumed processing of new data, and we started to process the backlog of data.
As you can imagine, as a company in the monitoring industry it's embarrassing to have such a fundamental failure in how we monitor our own systems. We're accelerating plans to overhaul and improve many of those systems and procedures.
We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at email@example.com.