Rollbar experienced a major outage caused by the interruption of service of one of our main datastores starting at 2:40 PM PDT on May 21st, with a period of total outages lasting until 5:50 PM, and a period of degraded service until 8:29 PM.
During this timeframe, we had an interruption of service to the API from 2:40 PM - 3:25 PM, an interruption of website accessibility from 2:40 PM - 5:50 PM, and a pipeline processing delay of until 6:25 PM PDT.
During recovery procedures after last week’s outage, we discovered that one of our central databases contained corrupted data. This corruption was limited in scope to a table that was left over from some prior maintenance, and was not in use by the product. During the maintenance work to remove the corrupted data, the primary and secondary databases both crashed due to a known bug in MySQL.
There was data loss due to the API outage lasting for 45 mins, during which period Rollbar was unable to ingest new data. While we were able to restore read functionality at 3:25 PM, we were unable to restore write access to the database for another 2 hours. During which time the processing pipeline and web UI were both down. Once write access was restored, the core processing pipeline was able to catch up in 35 mins. Notifications and realtime web UI updates took an additional 2 hours to catch up.
There was no corruption to any tables containing customer data.
The engineering team worked to restore service to the API while simultaneously working on restoring the database to a consistent and stable state for write operations to safely resume. A fix was identified at 4:15 PM, however due to the size and complexity of the database, it took another 1.5 hours to fully implement.
Given the circumstances, our engineers were able to get partial availability back relatively quickly. However, we understand that Rollbar's availability and stability is part of the core value that our service provides and our customers depend on. As such we will be rolling out increased redundancy and automation for our primary datastores over the coming weeks. Additionally we are instituting process changes to our change management procedures in order to make sure a situation like this doesn't happen again.
We deeply apologize for the disruption that this outage has caused. And we are dedicated to improving our service quality and stability in order to meet the expectations of our customers.
2:40 PM PDT - Main datastore becomes inaccessible
3:25 PM PDT - Read functionality is restored to the main datastore
5:50 PM PDT - Write functionality is restored to the main datastore and all processes are resumed successfully
6:25 PM PDT - Realtime data is available via the web UI
8:29 PM PDT - Full functionality restored.