Summary of the Incident and Impact
On March 25th, 2024, between 10:58 and 12:08 PDT Rollbar experienced a platform latency increase affecting the Web Application (rollbar.com) and Pipeline services. The cause of these issues can be traced to a combination of 2 releases that occurred in relatively quick succession.
One of the releases involved transitioning our package management for the Summarization service and the other was a code release containing a poorly optimized query that caused our database to increase load.
At 07:34 PDT on March 25th, a release of the Summarization service was completed using a new package management system. The release resulted in a change to an IP address that was used to configure a DNS that connected to this service. This resulted in requests that timed out and increased the page load latency on certain views of items in the Web Application.
At 10:08 a release was deployed to the Web Application and Pipeline services with a code change which resulted in a query that significantly increased disk IO on one of Rollbar’s main databases. Pipeline latency started to build as load increased on the server, and this further affected page load times on the Web Application.
Alerts triggered and brought attention to engineers as thresholds were breached at 10:21 PDT but since these 2 issues were compounding to affect latency, it was not immediately clear what the problem was. The application was still usable but significantly slow for some customers.
A series of reverts were made that brought the system back to stability.
Timeline:
Follow-Up Actions
To mitigate future risks and avoid similar incidents, we are undertaking the following actions: