Issues detecting item reactivations
Incident Report for Rollbar
Postmortem

Rollbar values transparency and understands that our growth is dependent on our ability to honestly reflect on where we can do better.

Between December 16th and December 18th, we experienced a system-wide degradation. As a result, error occurrences sent to us over a span of 48 hours from December 16th to December 18th were processed but with significant delays.

The incident resulted in knock-on effects that impacted our ability to deliver the best real-time experience. For the duration of the incident, even though the majority of errors sent to us appeared in the web app in real time, notifications were significantly delayed and some aspects of the UI were affected (such as occurrence counts).

Around 11:30 AM PT on December 16th, an unexpected interaction between our data retention deletion system and the rest of our pipeline caused one partition of a job queue to begin building up.

As soon as we detected the issue, we began our incident response process. Within 24 hours, on December 17th we isolated the issue and released a fix.

Due to the large queue of jobs to process, this placed major strain on the rest of our system since we received historically high traffic on top of the backlog of jobs to process. This meant that we did not finish processing the entire backlog until 11 AM PT on December 18th.

We will review and update our incident response process with the learnings from this event, no later than Jan 31 2022, and we will fold the learnings from this event into our upcoming infrastructure scaling initiatives.

Thank you for being a Rollbar customer, and happy holidays!

Posted Dec 24, 2021 - 07:31 PST

Resolved
This incident has been resolved.
Posted Dec 18, 2021 - 15:41 PST
Update
We have finished processing the pipeline. All operations are back to normal.

Thank you for your patience during this incident. Please expect a postmortem during the next week.
Posted Dec 18, 2021 - 15:41 PST
Update
We expect the entire backlog of events to be processed within the next 13 hours (by 6 PM Pacific). At that time, we will consider this incident resolved.
Posted Dec 18, 2021 - 05:31 PST
Update
We are seeing that some customers are not getting alerted about errors in real time. While we investigate the issue, please know that you can still see errors appear in real time on the web app.
Posted Dec 17, 2021 - 15:10 PST
Update
Our maintenance is complete. The pipeline is once again processing new errors sent to our API in real time.

Regarding the processing of our backlog from the past two days – we estimate that it could take a few more days to process the entire backlog. What this means is that while new errors sent to us are processed in real time, some of the notifications you will receive over the next few days will be notifications about errors you sent us up to two days ago.

Additionally, we would like to note that realtime update functionality on the web app is affected. This does not mean new items are not available. Instead, you may have to refresh the page manually to see them.

We apologize for the continued effects on your workflows.

This update replaces the planned update at 3 PM PT. You can expect another update tomorrow.
Posted Dec 17, 2021 - 14:40 PST
Update
Due to additional required maintenance there is still a delay in the processing pipeline but work is ongoing to resolve with minimal interruptions. A further update will be made at 3pm PT.
Posted Dec 17, 2021 - 12:38 PST
Update
The backlog of items are still being processed, a further update to the status will be posted at midday PT.
Posted Dec 17, 2021 - 09:49 PST
Update
The entire backlog of items to process will take up to 12 further hours. During this time, pipeline performance will be somewhat affected though most items continue to be processed in real time.

Please expect a further update no later than 5 AM Pacific. We will of course update sooner if there is pertinent new information that we need to communicate with you. Thank you.
Posted Dec 16, 2021 - 16:22 PST
Monitoring
We have resolved the recent issue with our pipeline. There was no data loss and our API layer was not affected.
Posted Dec 16, 2021 - 14:34 PST
Update
We are continuing to investigate this issue.
Posted Dec 16, 2021 - 14:22 PST
Investigating
Our pipeline has hit an issue while processing the backlog of items. We are investigating.
Posted Dec 16, 2021 - 14:05 PST
Update
We are now processing item reactivations and updating certain metadata like occurrences per item.

It will take some hours to process the entire backlog of items.

Instead of posting an update at 1 PM PT, we will post another update at 3 PM PT.
Posted Dec 16, 2021 - 12:44 PST
Monitoring
We have identified the issue and have implemented a fix. We are monitoring the results. Expect another update at 1 PM PT.
Posted Dec 16, 2021 - 12:30 PST
Update
We are now validating our fix and will be releasing it in the next hour. We will post another update at 1 PM PT.
Posted Dec 16, 2021 - 12:02 PST
Update
We are continuing to work to fix the issue. We will provide another update by 12:00 PM Pacific.
Posted Dec 16, 2021 - 11:29 PST
Update
We are continuing to investigate the issue. We are experiencing an issue with Kafka caused by yesterday's load, and we are preparing for a fix.
Posted Dec 16, 2021 - 08:33 PST
Investigating
We are experiencing an outage in one system component responsible for detecting item reactivations. At the moment, we will be unable to notify you about item reactivations.

Additionally, some secondary operations like updating the total number of occurrences per item are affected.

We are actively investigating the issue and will post an update as soon as we have further information. Thank you.
Posted Dec 16, 2021 - 07:15 PST
This incident affected: Web App (rollbar.com) and Processing pipeline (Core Processing Pipeline).