Rollbar experienced a major outage caused by the interruption of service of one of our main data stores starting at 1:40 AM PDT of May 6th and concluded with all the components fully restored by 2:33 AM PDT.
In this timeframe, we had an interruption of service: for approximately 10 minutes, we were not able to accept all the incoming data to our pipeline, and for approximately 14 minutes, a small number of notifications were sent to incorrect recipients.
Starting at 1:40 AM PDT May 6th, 2020, Rollbar experienced a sudden and unexpected failure of a core data store, which affected different components in our platform. Our team began bringing a hot replica online. The data self-recovered before the failover to the replica was complete.
During the period when our team was executing the failover to the hot replica, there was a short window (starting at 2:19 AM and ending at 2:33 AM) of data inconsistency across the two systems, which triggered incorrect notifications sent to the wrong recipients for a subset of our customers. The notifications were incorrect exclusively on the Item Title, Project Name, and Account Name. No other information was impacted. A separate security response has been initiated with the impacted customers to address this specific aspect of the incident.
The system was unable to accept incoming traffic for about 10 minutes during the whole incident. Otherwise, there was no data loss.
The Processing Pipeline was running and processing events in realtime after that, with potential delay for previously sent events until 3:27 AM PDT.
A small number of accounts were impacted by notifications triggered by the data inconsistency between 2:19 AM and 2:33 AM, where Item Title, Project Name, and Account Name reached the wrong recipient. All customers have already been directly notified with the specific data that may have been accidentally exposed.
These are the impacted services:
As part of our standard runbook, Rollbar triggered an immediate promotion of one of the hot replicas prepared for the specific scenario of a database failure. The affected data store was able to self-recover without intervention before the operations to bring the replica online were concluded.
The data inconsistency between the replica and the original data store caused a few systems to briefly run in a split-brain state, where some notifications were incorrectly triggered.
We promptly re-aligned the inconsistency and actively notified the affected customers about the incorrect notifications. We completed notifying all affected customers within 48 hours of the incident.
To avoid future similar scenarios, we have taken defensive actions at two separate levels:
Each of these changes guarantees safe operations if similar conditions occur again. As part of our “fail safe” engineering principle, we have redundantly implemented more than one resolution.
We’re investigating with the datastore vendor the reasons for the original failure, which appear to be unrelated to the traffic volume or the specific data set experienced at the moment of the incident. We’ll publish updates on this topic soon.
1:04 AM PDT - Main data store begins experiencing performance degradation
1:22 AM PDT - Main data store crashes and becomes unavailable
1:40 AM PDT - Main data store is restarted and begins recovery
1:45 AM PDT - API returns to baseline performance, but incoming occurrences are processed with a delay of up to 2 hours
2:20 AM PDT - Website comes back online due to database failover
2:23 AM PDT - Main data store finishes recovery and becomes available again
3:27 AM PDT - Pipeline performance returns to normal