API outage

Incident Report for Rollbar

Postmortem

Overview

Rollbar experienced a major outage caused by a significant increase in incoming traffic with intermittent service availability staring at 10.30pm PDT on April 21st and concluded with full service restoration by 8.03pm PDT on April 22nd.

What Happened

Starting at 10.30pm PDT of April 21st 2020 Rollbar observed a significant and sudden increase in incoming traffic. The traffic continued to rapidly increase until reaching a peak of 3,000% compared with the normal baseline.

The spike in traffic was caused by a legitimate increase of events in one of our customer’s platforms.

Rollbar immediately reacted with an increase in capacity, which resulted in the ability to serve approximately half of our incoming events.

A combination of a custom made SDKs from the mentioned customer plus the usage of a very specific and uncommon authentication pattern made the attempt of effectively rate limiting the source of the spike very ineffective, resulting in the need for Rollbar to build and deploy a custom made filtering system for traffic segregation.

Resolution

Rollbar reacted with an immediate increase in capacity and a change in our rate limiting system to effectively serve all our incoming traffic.

Rollbar is considering deprecating the authentication pattern that made our normal rate limiting system so fragile during this traffic spike, allowing for a more effective control at the very edge of our perimeter and better overall customer experience.

Impact

The outage affected two major components of our platform, resulting in a prolonged degraded service and a sporadic full outage. There was a 2 hour period of full/near full API outage with sporadic partial outages across a 19 hour period, starting at 1:15 AM.

During this time, customers would have likely received a large number of error responses from our API and the data sent was potentially dropped and not processed or alerted on.

(All times in PDT)

Web Application Impact
- [Apr 21st, 01:15 - 07:50] Degraded web performance and availability, sporadic error pages
APIs Impact:
- [Apr 22nd, 01:15 - 04:44] Degraded api performance and availability, serving roughly 75% of incoming traffic
- [Apr 22nd, 04:44 - 06:30] API begins to degraded progressively until Rollbar API is near full outage
- [Apr 22nd, 06:30 - 12:30] API begins to improve until serving around 90% of incoming traffic
- [Apr 22nd, 12:30 - 13:20] API experienced full outage
- [Apr 22nd, 13:20 - 16:24] API begin to recover, serving progressively more traffic
- [Apr 22nd, 16:23 - 18:08] API is fully operational for all customers except for one
- [Apr 22nd, 18:08 - 20:03] API is fully operational for all customers. The one customer with traffic spikes regains access to our platform through a freshly deployed traffic segregation system
- [Apr 22nd, 20:03] - API is fully operational for everyone

Timeline

(All times in PDT)

Apr 21st, 10:33: We saw a 600% increase in traffic on our API nodes
- This caused degradation of the API as it struggled to rate limit the traffic increase
Apr 22nd, 04:44: API servers were overwhelmed and we experience a close to total outage of the API
- We began to recover functionality progressively over the next few hours as we took steps to remediate
Apr 22nd, 08:30: Traffic continued to rise, now at 1,500% normal levels
Apr 22nd, 10:56: API was functioning close to normal, although with some degradation.
Apr 22nd, 12:00: Traffic peaked at 3,100% normal levels
Apr 22nd, 12:31: API and some downstream dependencies were overwhelmed again and we experienced a full outage of the API
Apr 22nd, 13:22: We regained some functionality to the api, began processing traffic at around 50% normal levels
Apr 22nd, 16:24: We successfully blocked a targeted portion of traffic, causing API usage to return to normal for nearly all customers
Apr 22nd, 17:30: We stabilized the infrastructure to handle the incoming traffic for all customers and removed the block, only experiencing very sporadic errors for a small subset of customers
Apr 22nd, 20:03: Incident fully resolved, functionality restored.

Action Items

Technical:

Fix an internal issue that was causing our Web tier to error when there was an API tier outage
We've identified and scheduled a few short term tasks that will greatly speed up our response time to massive load spikes
- Namely, a more robust way of identifying individual customers at the load balancer level and a way to quickly partition production traffic for a particular customer
A medium/long term plan to introduce a new stress/chaos testing approach to hardening our API tier
An internal audit of our usage SDKs/API that might cause service degradation
- Depending on the number of customers, we may begin to slowly deprecate parts of our API and replace them with more robust implementations

Customer-focused:

Over the next week our Customer Success and Engineering teams will hash out a plan to make sure we provide more timely updates to all customers as well as individual customers that are affected/causing service degradation
- Some initial plans
  - Much more frequent StatusPage updates during an active incident
  - Improve our current dedicated point of contact process for Enterprise customers

Posted Apr 23, 2020 - 17:36 PDT

Resolved

We've implemented a fix and will continue monitoring our systems. We have a post-mortem scheduled for tomorrow AM, afterwards we will publish the post-mortem and RFO to our status page.

The system is still experiencing a slightly elevated error rate for a handful of customers, but for the majority, the system is stable. In particular, we are paying attention to 502s being returned by our API tier. We are currently returning a handful per minute across the entire fleet, and only for a small subset of customers.

We deeply apologize for the outage. We know that many customers rely on our service to run their businesses. More info to follow tomorrow morning.

Posted Apr 22, 2020 - 20:14 PDT

Update

We've restored full service quality for the vast majority of customers (all but one). We've also fixed issues on the Web tier, for all customers. We're continuing to work to restore full service quality to the remaining customer.

Posted Apr 22, 2020 - 17:39 PDT

Update

We're continuing to work on a resolution for this incident.

Posted Apr 22, 2020 - 13:32 PDT

Update

We have identified the primary cause for the outage and are working on multiple mitigation strategies. We will write a full post mortem for this incident once it's resolved, but here are some details:

At around 10:30pm PDT 4/21/20 we began seeing 5.5x increase in traffic, all from a single customer. This caused our rate limiting system to lock up backend API servers which caused our load balancers to queue requests and retry multiple times, across the fleet. The congestion increased slowly until US Eastern traffic began to ramp up this morning. This cascaded into frequent outages on each API node.

We are now working on partitioning the incoming traffic so we can get data from most of our customers processed and back to normal or partially degraded functionality.

Posted Apr 22, 2020 - 11:18 PDT

Update

We're continuing to work toward a resolution for this issue.

Posted Apr 22, 2020 - 09:31 PDT

Update

We are continuing to work on a fix for this issue.

Posted Apr 22, 2020 - 06:34 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Apr 22, 2020 - 06:10 PDT

This incident affected: API Tier (api.rollbar.com).