Rollbar experienced a major outage caused by the interruption of service of one of our main datastores starting at 11:40 AM PDT on May 31st, with a period of total outages lasting until 12:11 PM, and a period of degraded service until 12:28 PM.
During this timeframe, we had an interruption of service to the API from 11:40 AM - 12:04 PM, an interruption of website accessibility from 11:40 AM - 12:11 PM, and a pipeline processing delay until 12:28 PM PDT.
At 11:40 AM PDT, our cloud provider performed a routine hardware migration of one of our main datastores. While normally these migrations have no effect, in this case the migration experienced a fault and our database instance was forced to reboot and went into crash recovery. Because of the size of the database, crash recovery took about 20 minutes.
There was data loss due to the API outage lasting for 24 mins, during which period Rollbar was unable to ingest new data. There was no corruption to any tables containing customer data. The website was inaccessible for 31 minutes. During the incident, the processing pipeline experienced a delay, and had degraded service for a total of 48 minutes.
The engineering team worked to simultaneously restore functionality to the API, website, and pipeline while monitoring the status of the database crash recovery. Our previous efforts meant that the database recovered quickly. However, we understand that even temporary instability on our part disrupts our customers.
To help prevent future incidents, we are reviewing our change management procedures. We are also dedicating significant resources to increasing the robustness of our datastores.
[May 31 11:40 AM PDT] Main datastore instance reboots and goes into crash recovery
[12:04 PM] Main datastore finishes crash recovery
[12:11 PM] Website becomes accessible
[12:28 PM] Pipeline catches up