Pipeline latency
Incident Report for Rollbar
Postmortem

Summary of the Incident and Impact

On March 25th, 2024, between 10:58 and 12:08 PDT Rollbar experienced a platform latency increase affecting the Web Application (rollbar.com) and Pipeline services.  The cause of these issues can be traced to a combination of 2 releases that occurred in relatively quick succession.

One of the releases involved transitioning our package management for the Summarization service and the other was a code release containing a poorly optimized query that caused our database to increase load.

At 07:34 PDT on March 25th, a release of the Summarization service was completed using a new package management system. The release resulted in a change to an IP address that was used to configure a DNS that connected to this service. This resulted in requests that timed out and increased the page load latency on certain views of items in the Web Application.

At 10:08 a release was deployed to the Web Application and Pipeline services with a code change which resulted in a query that significantly increased disk IO on one of Rollbar’s main databases. Pipeline latency started to build as load increased on the server, and this further affected page load times on the Web Application. 

Alerts triggered and brought attention to engineers as thresholds were breached at 10:21 PDT but since these 2 issues were compounding to affect latency, it was not immediately clear what the problem was. The application was still usable but significantly slow for some customers.

A series of reverts were made that brought the system back to stability.

Timeline:

  • March 25 07:34 PDT - Summarization service was deployed using a new package manager
  • 10:08 PDT - Changes to Rollbar’s Web Application and Pipeline were released with a poorly optimized database query
  • 10:21 PDT - Alerts internal to Rollbar started to trigger as latency spiked in various places
  • 10:58 PDT - General stability of the Web Application and Pipeline are affected with some customers reporting slow loading or unreachable pages
  • 11:26 PDT - The changes to the Web Application and Pipeline were reverted and deployed
  • 12:08 PDT - The changes to the Summarization service were reverted and full stability was reached

Follow-Up Actions

To mitigate future risks and avoid similar incidents, we are undertaking the following actions:

  • We are actively working on addressing how we reconcile the IP addresses with our DNS for the summarization service and looking to improve this process.
  • We will be having a full internal postmortem on this event by April 5, 2024, and expect to identify further action items to improve our systems.
Posted Apr 04, 2024 - 13:45 PDT

Resolved
This incident has been resolved.
Posted Mar 25, 2024 - 12:32 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 25, 2024 - 12:14 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 25, 2024 - 11:35 PDT
Update
We are continuing to investigate this issue.
Posted Mar 25, 2024 - 10:50 PDT
Investigating
We are currently experiencing an issue within the pipeline which might be causing some latency in processing, we are investigating this issue and will provide and update as more is understood.
Apologies for any inconvenience.
Posted Mar 25, 2024 - 10:50 PDT
This incident affected: Processing pipeline (Core Processing Pipeline, iOS Symbolication pipeline, Source map symbolication pipeline).