Processing Pipeline Delay

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Jul 14, 2020, 05:19 PM

Jul 15, 2020, 04:28 AM

Updates

Write-up published

Read it here

Resolved

Overview

Between July 14th and 15th Rollbar experienced a major outage caused by three different and independent incidents, which resulted in an intermittent connectivity issue \(July 14th, from 7:16am PDT until 8:47am\) and a substantial increase in latency in our processing pipeline.

What Happened

These are the three main issues which resulted in the connectivity loss and spikes in processing latency:

An issue in updating our SSL/TSL certificates, which was generated incorrectly without the necessary intermediary CA certificate, resulting in the intermittent inability to send data to the Rollbar cloud.
An unexpected outage in our Cloud Provider forced our Redis cluster to be restored with a reset of most of the Rate Limit data, allowing a much higher amount of traffic to be processed.
A substantial increase in incoming traffic \(up to 400% compared to the baseline\) while the previously described incidents were still active, resulting in a massive backlog of items in need of processing, quickly saturating our available resources.

Impact

July 14th, from 7:16am PDT until 8:47am: Rollbar SDKs are unable to send data to our cloud infrastructure.
July 14th, from 12:33pm PDT until 12:45pm: Our processing pipeline is paused while fixing a bug in our deploy automation. We didn’t experience any loss of data, but no data is processed real time.
July 14th, from 8:47am PDT until 9:28pm : A subset of data sent to Rollbar present delay in processing. New data is processed in Real Time as usual, but historical data \(like errors cached between 7:16am PDT and 8:47am\) present a significant delay.
July 15th, from 11:43am PDT until 3:00pm : Some Notifications, iOS items, Versions and some item statistics are not getting processed \(based on longevity in our worker queues\).
July 15th, from 3:00pm PDT until 6:01pm: Data that has been submitted and not processed for longer than 30 mins is kept in our system but not actively processed.

Resolution

Due to the nature of the incident, which can be described as three independent issues happening at the same time, we implemented the following remediations \(numbering matches the issue described in the “What Happened” section\):

The SSL/TSL certificate was fixed with the inclusion of the intermediary CA certificate. Due to the nature of caching used by some of our customers, the problem was noticeable for a longer time window than the one actually experienced. The certificate generation is implemented as a manual operation \(while the deployment is fully automated like the rest of our infrastructure\). Our internal run books have been updated with the correct procedure to avoid similar issues in the future.
When our Cloud Provider unexpectedly reset our Redis cluster, we immediately reacted with a reconfiguration of it based on the latest available snapshot. We have no reasons to believe this issue would happen again in the future, but in line with our operational policy we decided to invest in a refactoring of our Redis cluster that will allow persistence in case of a total loss of the instance like we experienced on July 15th.
Due to the increase in traffic we scaled up all the workers across the board to minimize the impact in latency, prioritizing the processing of real time data and notifications over historical data to make sure our customer’s automation won’t be heavily impacted. On July 15th at 6:01pm PDT we decided to skip the processing of all the data that has been sitting in our queues for more than 30 minutes waiting to be processed. The vast majority of these occurrences would have been rejected by our API layer in normal circumstances.

While the incident has been open for many hours, most of the data collected during this time window has been processed real time with no significant delay for most of our customers. The data we decided to skip in order to regain real time capabilities is not lost and still present in our data stores \(aggregations might be missing\).

Timeline

July 14th, 7:16am PDT: Rollbar deployed a new SSL certificate which did not contain intermediary cert, causing a subset of the API calls from our SDKs to fail.
July 14th, 8:47am PDT: Correct certificates are rolled out and normal operations are restored.
July 14th: 8:54am PDT: Incoming traffic raised up 200% of our baseline. Part of the spike is related to the ability of our SDKs to cache errors and delay the upload on our cloud infrastructure during API outages.
July 14th, 12:00pm PDT: Different changes in our platform are fully deployed, allowing Rollbar to handle the increase of incoming traffic.
July 14th, 12:33pm PDT: A bug in our new workers provisioning systems, triggered while scaling our platform up, accidentally stalled the processing pipeline.
July 14th, 12:45pm PDT: The bug in our new workers provisioning systems is fixed and we’re able to continue adding workers to our pipeline.
July 14th, 9:28pm PDT: All data is processed and queues are back to minimal latency.
July 15th, 11:43am PDT: We discovered that, related to the increase in our incoming traffic, some shards in our infrastructure are performing extremely poorly, causing our latency and processing delay to rise significantly.
July 15th, 3:00pm PDT: Our main Cloud Provider suffers a major outage that impacts our Redis instance, causing the loss of temporary data used to power our rate limiting system. With the reset of our rate limits, the incoming traffic keeps increasing, reaching 400% compared to the standard baseline. Processing delay and latency are rising again.
July 15th, 5:08pm PDT: Redis data is restored and traffic levels are back to normal.
July 15th, 6:01pm PDT: To revert some of the negative effects of the rate limit reset, we decided to skip the processing of all the data that have been sitting in our queues for more than 30 minutes waiting to be processed. The vast majority of these occurrences would have been rejected by our API layer in normal circumstances.
July 15th, 6:48pm PDT: Everything is operational and average processing latency is back to <10sec.

Tue, Jul 21, 2020, 04:25 AM

Resolved

This incident has been resolved.

Wed, Jul 15, 2020, 04:28 AM(5 days earlier)

Monitoring

Pipeline now on target to normalize in 1.5 hrs. Will continue to update as we get closer.

Wed, Jul 15, 2020, 02:22 AM(2 hours earlier)

Monitoring

The pipeline should be fully normalized in <3 hours. Will continue to update as we optimize our capacity. Thanks for your patience!

Tue, Jul 14, 2020, 10:45 PM(3 hours earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Tue, Jul 14, 2020, 08:43 PM(2 hours earlier)

Identified

We have successfully expanded our capacity to handle the new load and are finally making headway through the backlog. ETA for normalization is 8pm PST. Thanks for your patience!

Tue, Jul 14, 2020, 08:43 PM

Identified

We are working to deal with a large amount of additional load on the pipeline, will continue to update as we increase capacity.

Tue, Jul 14, 2020, 07:49 PM(54 minutes earlier)

Monitoring

We are actively monitoring the backlog.

Tue, Jul 14, 2020, 05:26 PM(2 hours earlier)

Investigating

Due to the recent API issues, we are currently processing a very large number of errors from clients. Our pipeline is processing the backlog now, will update once we have an estimate of when it will complete. Thanks for your patience!

Tue, Jul 14, 2020, 05:19 PM