At 5 PM PT on Aug 26, 2021, we performed a pre-scheduled database procedure that involved creating a new MySQL cluster and re-configuring our services to use this cluster. Unfortunately, we ran into some issues (described below) and caused a long delay in our item processing pipeline. However, our API remained available during the incident, and there has been no data loss with one exception, detailed below.
Around 6:30 PM PT, we realized that we had made a mistake when re-configuring our services to use our new cluster. At this point, we disabled parts of our pipeline to protect data integrity, but that meant we became unable to process items fully. Submitted items were stored in an intermediary queue.
We were able to start processing items again at 1:07 AM PT. By around 4 AM PT, we had finished processing the delayed items.
The core issue that took us from 6:30 PM PT to past 1 AM PT to resolve was a configuration error with the new MySQL cluster we had created. This error caused the MySQL process on the leader to crash soon after startup. Eventually, we decided to use a pre-existent cluster known to work. Because the MySQL process on the new, non-working cluster could accept some data, we needed to transfer it to the pre-existent, working cluster, which took over an hour. Once this data transfer was complete, we were able to start processing items again.
Earlier, we mentioned that there was a small amount of data loss. The MySQL process on the non-working cluster was able to accept data but would crash soon after startup. That means that it is possible that the MySQL process crashed before it was able to commit 100% of in-memory data to disk. This means that when we transferred data to the pre-existent working cluster, any data stored only in memory by MySQL was not transferred. Based on our MySQL configuration, the amount of data stored in memory that was potentially not flushed to disk when the process crashed is no more than 10 seconds worth of error data.
At the root of the incident was a configuration error in the new cluster. To remediate this going forward, we are changing two things.
Due to Rollbar’s scale, operations like the one described above are difficult to fully de-risk, but that is no excuse. We deeply apologize for this extended delay in item processing. Thank you for making Rollbar a critical part of your stack.