[Scheduled] DB Maintenance

Scheduled Maintenance Report for Rollbar

Postmortem

Incident Summary
On Friday, October 11th at approximately 4:30 PM (all times PST) until 8:30 PM our data processing pipeline was lagging behind by up to 3 hours. This had downstream effects that caused data to not be immediately accessible in rollbar.com after being sent to the Rollbar API.

Analysis and Next Steps
At approximately 4:30 PM we began routine maintenance of one of our MySQL clusters. During this maintenance we missed an updated configuration on some of our servers that handle data processing. Around 5:20 PM we realized our mistake and began rectifying the configuration on various servers. However, as we began updating certain portions of our pipeline it created downstream issues that caused our system to become overloaded and slow. This slowness caused our traditional deployment strategies to fail, which then left us to begin deploying each server one-by-one. By 6:30 PM we were able to correct the configuration issues on enough of our servers that we could update the remaining with our traditional deployment methods and began to see the system normalizing. By 8:30 PM our pipeline had completely caught up and data was no longer lagging behind.

Fortunately this issue was localized to our processing pipeline and our ingestion system was not affected. We are confident that we did not lose any incoming data during the time of the incident and that we successfully back processed all data sent via the Rollbar API.

What we're doing next
In the near-term, we plan to automate more of this maintenance so we can avoid the possibility of human error. Longer term, we're working towards a new architecture for this datastore to eliminate the need for this kind of maintenance.

We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.

Posted Oct 15, 2019 - 06:23 PDT

Completed

The scheduled maintenance has been completed.

Posted Oct 11, 2019 - 19:00 PDT

Update

The pipeline was delayed significantly more than expected by the maintenance. Expect approximately ~1 hr delay in processing. Very sorry for the inconvenience.

Posted Oct 11, 2019 - 18:47 PDT

Update

There was additional delay in the pipeline, pipeline should recover shortly.

Posted Oct 11, 2019 - 17:50 PDT

Verifying

Maintenance is complete, now waiting for the pipeline to catch up.

Posted Oct 11, 2019 - 17:26 PDT

Update

We are about to enter the final period of maintenance where the the pipeline can be delayed up to 10 minutes. Sorry for the inconvenience, will update when finished.

Posted Oct 11, 2019 - 17:05 PDT

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Oct 11, 2019 - 16:56 PDT

Scheduled

Performing db maintenance, we don't expect any web app or pipeline downtime but there may be processing delays of up to 5-10 minutes.

Posted Oct 09, 2019 - 15:51 PDT

This scheduled maintenance affected: Processing pipeline (Core Processing Pipeline).