Between July 14th and 15th Rollbar experienced a major outage caused by three different and independent incidents, which resulted in an intermittent connectivity issue (July 14th, from 7:16am PDT until 8:47am) and a substantial increase in latency in our processing pipeline.
These are the three main issues which resulted in the connectivity loss and spikes in processing latency:
- An issue in updating our SSL/TSL certificates, which was generated incorrectly without the necessary intermediary CA certificate, resulting in the intermittent inability to send data to the Rollbar cloud.
- An unexpected outage in our Cloud Provider forced our Redis cluster to be restored with a reset of most of the Rate Limit data, allowing a much higher amount of traffic to be processed.
- A substantial increase in incoming traffic (up to 400% compared to the baseline) while the previously described incidents were still active, resulting in a massive backlog of items in need of processing, quickly saturating our available resources.
- July 14th, from 7:16am PDT until 8:47am: Rollbar SDKs are unable to send data to our cloud infrastructure.
- July 14th, from 12:33pm PDT until 12:45pm: Our processing pipeline is paused while fixing a bug in our deploy automation. We didn’t experience any loss of data, but no data is processed real time.
- July 14th, from 8:47am PDT until 9:28pm : A subset of data sent to Rollbar present delay in processing. New data is processed in Real Time as usual, but historical data (like errors cached between 7:16am PDT and 8:47am) present a significant delay.
- July 15th, from 11:43am PDT until 3:00pm : Some Notifications, iOS items, Versions and some item statistics are not getting processed (based on longevity in our worker queues).
- July 15th, from 3:00pm PDT until 6:01pm: Data that has been submitted and not processed for longer than 30 mins is kept in our system but not actively processed.
Due to the nature of the incident, which can be described as three independent issues happening at the same time, we implemented the following remediations (numbering matches the issue described in the “What Happened” section):
- The SSL/TSL certificate was fixed with the inclusion of the intermediary CA certificate. Due to the nature of caching used by some of our customers, the problem was noticeable for a longer time window than the one actually experienced. The certificate generation is implemented as a manual operation (while the deployment is fully automated like the rest of our infrastructure). Our internal run books have been updated with the correct procedure to avoid similar issues in the future.
- When our Cloud Provider unexpectedly reset our Redis cluster, we immediately reacted with a reconfiguration of it based on the latest available snapshot. We have no reasons to believe this issue would happen again in the future, but in line with our operational policy we decided to invest in a refactoring of our Redis cluster that will allow persistence in case of a total loss of the instance like we experienced on July 15th.
- Due to the increase in traffic we scaled up all the workers across the board to minimize the impact in latency, prioritizing the processing of real time data and notifications over historical data to make sure our customer’s automation won’t be heavily impacted. On July 15th at 6:01pm PDT we decided to skip the processing of all the data that has been sitting in our queues for more than 30 minutes waiting to be processed. The vast majority of these occurrences would have been rejected by our API layer in normal circumstances.
While the incident has been open for many hours, most of the data collected during this time window has been processed real time with no significant delay for most of our customers. The data we decided to skip in order to regain real time capabilities is not lost and still present in our data stores (aggregations might be missing).
- July 14th, 7:16am PDT: Rollbar deployed a new SSL certificate which did not contain intermediary cert, causing a subset of the API calls from our SDKs to fail.
- July 14th, 8:47am PDT: Correct certificates are rolled out and normal operations are restored.
- July 14th: 8:54am PDT: Incoming traffic raised up 200% of our baseline. Part of the spike is related to the ability of our SDKs to cache errors and delay the upload on our cloud infrastructure during API outages.
- July 14th, 12:00pm PDT: Different changes in our platform are fully deployed, allowing Rollbar to handle the increase of incoming traffic.
- July 14th, 12:33pm PDT: A bug in our new workers provisioning systems, triggered while scaling our platform up, accidentally stalled the processing pipeline.
- July 14th, 12:45pm PDT: The bug in our new workers provisioning systems is fixed and we’re able to continue adding workers to our pipeline.
- July 14th, 9:28pm PDT: All data is processed and queues are back to minimal latency.
- July 15th, 11:43am PDT: We discovered that, related to the increase in our incoming traffic, some shards in our infrastructure are performing extremely poorly, causing our latency and processing delay to rise significantly.
- July 15th, 3:00pm PDT: Our main Cloud Provider suffers a major outage that impacts our Redis instance, causing the loss of temporary data used to power our rate limiting system. With the reset of our rate limits, the incoming traffic keeps increasing, reaching 400% compared to the standard baseline. Processing delay and latency are rising again.
- July 15th, 5:08pm PDT: Redis data is restored and traffic levels are back to normal.
- July 15th, 6:01pm PDT: To revert some of the negative effects of the rate limit reset, we decided to skip the processing of all the data that have been sitting in our queues for more than 30 minutes waiting to be processed. The vast majority of these occurrences would have been rejected by our API layer in normal circumstances.
- July 15th, 6:48pm PDT: Everything is operational and average processing latency is back to <10sec.