Pipeline delay

Incident Report for Rollbar

Postmortem

At 5 PM PT on Aug 26, 2021, we performed a pre-scheduled database procedure that involved creating a new MySQL cluster and re-configuring our services to use this cluster. Unfortunately, we ran into some issues (described below) and caused a long delay in our item processing pipeline. However, our API remained available during the incident, and there has been no data loss with one exception, detailed below.

Around 6:30 PM PT, we realized that we had made a mistake when re-configuring our services to use our new cluster. At this point, we disabled parts of our pipeline to protect data integrity, but that meant we became unable to process items fully. Submitted items were stored in an intermediary queue.

We were able to start processing items again at 1:07 AM PT. By around 4 AM PT, we had finished processing the delayed items.

The core issue that took us from 6:30 PM PT to past 1 AM PT to resolve was a configuration error with the new MySQL cluster we had created. This error caused the MySQL process on the leader to crash soon after startup. Eventually, we decided to use a pre-existent cluster known to work. Because the MySQL process on the new, non-working cluster could accept some data, we needed to transfer it to the pre-existent, working cluster, which took over an hour. Once this data transfer was complete, we were able to start processing items again.

Earlier, we mentioned that there was a small amount of data loss. The MySQL process on the non-working cluster was able to accept data but would crash soon after startup. That means that it is possible that the MySQL process crashed before it was able to commit 100% of in-memory data to disk. This means that when we transferred data to the pre-existent working cluster, any data stored only in memory by MySQL was not transferred. Based on our MySQL configuration, the amount of data stored in memory that was potentially not flushed to disk when the process crashed is no more than 10 seconds worth of error data.

At the root of the incident was a configuration error in the new cluster. To remediate this going forward, we are changing two things.

We will be working more closely with Percona, a leading MySQL consultancy. They will audit and verify the configuration of our current and future MySQL instances. In general, starting this quarter, we are dedicating significantly more resources to MySQL operations.
Though we used a runbook and performed multiple test runs ahead of the main procedure, there was still room for human error during the main procedure. To mitigate this, we are changing our guidelines to ensure that if an operator must take a step that isn’t in the runbook, their step will first be added to the runbook and evaluated by the rest of the team.

Due to Rollbar’s scale, operations like the one described above are difficult to fully de-risk, but that is no excuse. We deeply apologize for this extended delay in item processing. Thank you for making Rollbar a critical part of your stack.

Posted Sep 01, 2021 - 10:34 PDT

Resolved

Processing delay caught back up early this morning. We are still looking into some graphs that are not updating, but realtime alerts should be fully functional.

We will publish a postmortem on the incident early next week.

Posted Aug 27, 2021 - 11:04 PDT

Update

The processing pipeline and database are stable. We are continuing to monitor progress while processing catches back up.

Posted Aug 27, 2021 - 02:23 PDT

Monitoring

The failover to the new database has finished and we are now monitoring the processing pipeline while it catches up.

Posted Aug 27, 2021 - 01:07 PDT

Update

We are continuing to fail over to the new database and will post an update when we have more information.

Posted Aug 26, 2021 - 23:08 PDT

Update

We are failing over to a new database which will incur more processing latency. We don't expect any data loss and will update with progress shortly.

Posted Aug 26, 2021 - 21:58 PDT

Update

We are still in the process of starting the pipeline back up. We will continue posting updates as we complete this process.

Posted Aug 26, 2021 - 20:17 PDT

Update

We are accepting new data but our pipeline remains stopped. We are working to start the pipeline again.
The website remains available.

Posted Aug 26, 2021 - 19:05 PDT

Identified

We have stopped our item processing pipeline during our maintenance. We will send an update once we start the pipeline back up. Items will be processed with a delay.

Posted Aug 26, 2021 - 18:36 PDT

This incident affected: Processing pipeline (Core Processing Pipeline, iOS Symbolication processing pipeline).