Incident Summary
On Friday, October 11th at approximately 4:30 PM (all times PST) until 8:30 PM our data processing pipeline was lagging behind by up to 3 hours. This had downstream effects that caused data to not be immediately accessible in rollbar.com after being sent to the Rollbar API.
Analysis and Next Steps
At approximately 4:30 PM we began routine maintenance of one of our MySQL clusters. During this maintenance we missed an updated configuration on some of our servers that handle data processing. Around 5:20 PM we realized our mistake and began rectifying the configuration on various servers. However, as we began updating certain portions of our pipeline it created downstream issues that caused our system to become overloaded and slow. This slowness caused our traditional deployment strategies to fail, which then left us to begin deploying each server one-by-one. By 6:30 PM we were able to correct the configuration issues on enough of our servers that we could update the remaining with our traditional deployment methods and began to see the system normalizing. By 8:30 PM our pipeline had completely caught up and data was no longer lagging behind.
Fortunately this issue was localized to our processing pipeline and our ingestion system was not affected. We are confident that we did not lose any incoming data during the time of the incident and that we successfully back processed all data sent via the Rollbar API.
What we're doing next
In the near-term, we plan to automate more of this maintenance so we can avoid the possibility of human error. Longer term, we're working towards a new architecture for this datastore to eliminate the need for this kind of maintenance.
We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.