Rollbar values transparency and understands that our growth is dependent on our ability to honestly reflect on where we can do better.
Between December 16th and December 18th, we experienced a system-wide degradation. As a result, error occurrences sent to us over a span of 48 hours from December 16th to December 18th were processed but with significant delays.
The incident resulted in knock-on effects that impacted our ability to deliver the best real-time experience. For the duration of the incident, even though the majority of errors sent to us appeared in the web app in real time, notifications were significantly delayed and some aspects of the UI were affected (such as occurrence counts).
Around 11:30 AM PT on December 16th, an unexpected interaction between our data retention deletion system and the rest of our pipeline caused one partition of a job queue to begin building up.
As soon as we detected the issue, we began our incident response process. Within 24 hours, on December 17th we isolated the issue and released a fix.
Due to the large queue of jobs to process, this placed major strain on the rest of our system since we received historically high traffic on top of the backlog of jobs to process. This meant that we did not finish processing the entire backlog until 11 AM PT on December 18th.
We will review and update our incident response process with the learnings from this event, no later than Jan 31 2022, and we will fold the learnings from this event into our upcoming infrastructure scaling initiatives.
Thank you for being a Rollbar customer, and happy holidays!