Unscheduled DB Maintenance

Incident Report for Rollbar

Postmortem

Between Friday March 9th at 11:25 AM (All times are PST) and Saturday March 10th at 12:30 PM we experienced three related outages that resulted in a total of roughly 35 minutes of web UI downtime and 15 minutes of API unavailability with associated processing delays of up to 30 minutes. While attempting to recover from the first outage we also introduced a configuration error that caused some customers search and item list display issues from around 3:10 PM on Friday until 7:30 AM on Saturday.

We understand that you rely on Rollbar for real-time monitoring of your applications and systems, we apologize for letting you down, and we'd like to explain what happened and what actions we're taking to prevent it from happening again.

Background

If you follow the Rollbar Status Page (http://status.rollbar.com/) you've no doubt noticed occasional “Unplanned DB Maintenance” incidents over the past few weeks. We've been having database performance issues that crop up when we experience a particular pattern of database activity, and the fastest way to recover has been to endure a 2-3 minute web UI outage and simply restart the affected database instance. These database restarts don't result in an API outage or downtime, but processing is delayed for a few minutes. We've been working with our database vendor to identify and remediate the underlying problem, and last Thursday we thought we'd found a solution. Our testing gave us hope we'd seen the last of the “unplanned maintenance” events, and Friday at 5:00 AM we deployed a configuration change.

Outage Timeline, Cause, and Resolution

Unfortunately, as is often the case the behavior of our production database didn't match what we'd observed in our test environment and over the next few hours we realized we hadn't fixed the problem. On Friday at 11:25 AM the system did something it hadn't done before: our master database instance simply froze, causing the web UI to go offline and pipeline processing to stall. We were unable to restart the instance and initiated failover to another instance in the cluster. The web UI recovered at 11:45 AM and pipeline processing resumed, however the pipeline didn't fully recover until 12:55 PM.

On Friday at around 3:10 PM while working on recovery of the frozen database instance, we broke replication to a different instance that was powering search and the item list view for some customers. We didn't discover or correct this issue until 7:30 AM on Saturday.

On Saturday at 6:07 AM while attempting to rebuild the database instance that had failed, one of our engineers inadvertently detached a disk from the wrong Google Compute Engine instance, causing the new master database to crash. This took the web UI offline once again. Because of the cluster configuration changes we'd made on Friday to accommodate the frozen DB instance our API also went offline. We were back online by 6:15 AM, however we experienced processing delays until around 6:22 AM.

The third and final outage occurred on Saturday around 12:20 PM when we performed a database restart necessary to bring the cluster back online in a fully operational state. Both the web UI and API were offline from 12:20 PM until 12:27 PM. The processing pipeline caught up and all system were operating normally as of 12:30 PM on Saturday, March 10th.

Unfortunately one result of this chain of events is that we reverted the initial configuration change we'd expected to solve the underlying performance issue, so for now it's possible you'll continue to see occasional “Unplanned DB Maintenance” events.

Evaluating our Response

In each case where we had a UI or processing outage, we were actively monitoring the affected systems when the outage occurred, identified the cause within just a minute or two of the outage, and took immediate steps to resolve the issue. We hope we also provided what you felt were timely updates via status.rollbar.com. The systems we have in place for our engineers to collaborate were effective, and we had very little friction in terms of process and troubleshooting.

In the case of the search and item list issues we introduced on Friday afternoon and didn't resolve until Saturday morning, we're certainly not happy with our response. As can sometimes happen during incidents like this, we were overwhelmed with alerts all triggered by the same fundamental issue and missed the one that was directing us to this problem. While we were quick to correct the issue when it was brought to our attention by users, we should've seen the problem and corrected it earlier.

Improvements

We're continuing to work with our database vendor to address the underlying performance issue that triggered this chain of events.
We'll be making significant changes to our database cluster implementation to improve availability and make failover a less intrusive operation.
While we've made significant strides in making our API tier resilient to the type of datastore and component outages we experience internally, there's certainly more room for improvement.
We'll be reviewing our change management processes and procedures, particularly as relates to our production environment. We need to be better at protecting engineers from making what could be catastrophic configuration changes on their own without help from and review by a second set of eyes.
While we're constantly trying to improve our monitoring and alerting systems, it's time for another full audit. We need to ensure that checks are meaningful, needed checks are present, and alerts are functioning correctly. One area that needs particular attention is the extent to which we rely on “component” level alerts vs “system” level alerts and alerts on key performance indicators. It's easy to miss things when the signal-to-noise ratio gets out of hand during an incident and you see a wave of redundant alerts all triggered to the same underlying problem.

In Conclusion

We know that you rely on Rollbar and we hate to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.

Posted Jul 25, 2018 - 18:26 PDT

Resolved

The processing pipeline has caught up and all systems are functioning normally. We'll continue to monitor the system throughout the day.

We'll conduct a postmortem about this event and plan to post it early next week.

Posted Mar 09, 2018 - 12:55 PST

Update

We're expecting full recovery in about 15 minutes.

Posted Mar 09, 2018 - 12:44 PST

Monitoring

The web app is back online and the processing pipeline is catching up.

Posted Mar 09, 2018 - 11:55 PST

Update

The maintenance is taking longer than expected; we're working to get the web app and processing pipeline back online asap.

Posted Mar 09, 2018 - 11:37 PST

Investigating

We're performing some brief unplanned DB maintenance that will cause the web UI to be unresponsive for ~2-3 minutes with a similar processing pipeline delay.

Posted Mar 09, 2018 - 11:26 PST