Between Friday March 9th at 11:25 AM (All times are PST) and Saturday March 10th at 12:30 PM we experienced three related outages that resulted in a total of roughly 35 minutes of web UI downtime and 15 minutes of API unavailability with associated processing delays of up to 30 minutes. While attempting to recover from the first outage we also introduced a configuration error that caused some customers search and item list display issues from around 3:10 PM on Friday until 7:30 AM on Saturday.
We understand that you rely on Rollbar for real-time monitoring of your applications and systems, we apologize for letting you down, and we'd like to explain what happened and what actions we're taking to prevent it from happening again.
If you follow the Rollbar Status Page (http://status.rollbar.com/) you've no doubt noticed occasional “Unplanned DB Maintenance” incidents over the past few weeks. We've been having database performance issues that crop up when we experience a particular pattern of database activity, and the fastest way to recover has been to endure a 2-3 minute web UI outage and simply restart the affected database instance. These database restarts don't result in an API outage or downtime, but processing is delayed for a few minutes. We've been working with our database vendor to identify and remediate the underlying problem, and last Thursday we thought we'd found a solution. Our testing gave us hope we'd seen the last of the “unplanned maintenance” events, and Friday at 5:00 AM we deployed a configuration change.
Unfortunately, as is often the case the behavior of our production database didn't match what we'd observed in our test environment and over the next few hours we realized we hadn't fixed the problem. On Friday at 11:25 AM the system did something it hadn't done before: our master database instance simply froze, causing the web UI to go offline and pipeline processing to stall. We were unable to restart the instance and initiated failover to another instance in the cluster. The web UI recovered at 11:45 AM and pipeline processing resumed, however the pipeline didn't fully recover until 12:55 PM.
On Friday at around 3:10 PM while working on recovery of the frozen database instance, we broke replication to a different instance that was powering search and the item list view for some customers. We didn't discover or correct this issue until 7:30 AM on Saturday.
On Saturday at 6:07 AM while attempting to rebuild the database instance that had failed, one of our engineers inadvertently detached a disk from the wrong Google Compute Engine instance, causing the new master database to crash. This took the web UI offline once again. Because of the cluster configuration changes we'd made on Friday to accommodate the frozen DB instance our API also went offline. We were back online by 6:15 AM, however we experienced processing delays until around 6:22 AM.
The third and final outage occurred on Saturday around 12:20 PM when we performed a database restart necessary to bring the cluster back online in a fully operational state. Both the web UI and API were offline from 12:20 PM until 12:27 PM. The processing pipeline caught up and all system were operating normally as of 12:30 PM on Saturday, March 10th.
Unfortunately one result of this chain of events is that we reverted the initial configuration change we'd expected to solve the underlying performance issue, so for now it's possible you'll continue to see occasional “Unplanned DB Maintenance” events.
In each case where we had a UI or processing outage, we were actively monitoring the affected systems when the outage occurred, identified the cause within just a minute or two of the outage, and took immediate steps to resolve the issue. We hope we also provided what you felt were timely updates via status.rollbar.com. The systems we have in place for our engineers to collaborate were effective, and we had very little friction in terms of process and troubleshooting.
In the case of the search and item list issues we introduced on Friday afternoon and didn't resolve until Saturday morning, we're certainly not happy with our response. As can sometimes happen during incidents like this, we were overwhelmed with alerts all triggered by the same fundamental issue and missed the one that was directing us to this problem. While we were quick to correct the issue when it was brought to our attention by users, we should've seen the problem and corrected it earlier.
We know that you rely on Rollbar and we hate to let you down. If you have any questions, please don't hesitate to contact us at firstname.lastname@example.org.