Rollbar experienced a major outage caused by a significant increase in incoming traffic with intermittent service availability staring at 10.30pm PDT on April 21st and concluded with full service restoration by 8.03pm PDT on April 22nd.
Starting at 10.30pm PDT of April 21st 2020 Rollbar observed a significant and sudden increase in incoming traffic. The traffic continued to rapidly increase until reaching a peak of 3,000% compared with the normal baseline.
The spike in traffic was caused by a legitimate increase of events in one of our customer’s platforms.
Rollbar immediately reacted with an increase in capacity, which resulted in the ability to serve approximately half of our incoming events.
A combination of a custom made SDKs from the mentioned customer plus the usage of a very specific and uncommon authentication pattern made the attempt of effectively rate limiting the source of the spike very ineffective, resulting in the need for Rollbar to build and deploy a custom made filtering system for traffic segregation.
Rollbar reacted with an immediate increase in capacity and a change in our rate limiting system to effectively serve all our incoming traffic.
Rollbar is considering deprecating the authentication pattern that made our normal rate limiting system so fragile during this traffic spike, allowing for a more effective control at the very edge of our perimeter and better overall customer experience.
The outage affected two major components of our platform, resulting in a prolonged degraded service and a sporadic full outage. There was a 2 hour period of full/near full API outage with sporadic partial outages across a 19 hour period, starting at 1:15 AM.
During this time, customers would have likely received a large number of error responses from our API and the data sent was potentially dropped and not processed or alerted on.
(All times in PDT)
Web Application Impact
(All times in PDT)
Apr 21st, 10:33: We saw a 600% increase in traffic on our API nodes
Apr 22nd, 04:44: API servers were overwhelmed and we experience a close to total outage of the API
Apr 22nd, 08:30: Traffic continued to rise, now at 1,500% normal levels
Apr 22nd, 10:56: API was functioning close to normal, although with some degradation.
Apr 22nd, 12:00: Traffic peaked at 3,100% normal levels
Apr 22nd, 12:31: API and some downstream dependencies were overwhelmed again and we experienced a full outage of the API
Apr 22nd, 13:22: We regained some functionality to the api, began processing traffic at around 50% normal levels
Apr 22nd, 16:24: We successfully blocked a targeted portion of traffic, causing API usage to return to normal for nearly all customers
Apr 22nd, 17:30: We stabilized the infrastructure to handle the incoming traffic for all customers and removed the block, only experiencing very sporadic errors for a small subset of customers
Apr 22nd, 20:03: Incident fully resolved, functionality restored.
We've identified and scheduled a few short term tasks that will greatly speed up our response time to massive load spikes
A medium/long term plan to introduce a new stress/chaos testing approach to hardening our API tier
An internal audit of our usage SDKs/API that might cause service degradation
Over the next week our Customer Success and Engineering teams will hash out a plan to make sure we provide more timely updates to all customers as well as individual customers that are affected/causing service degradation
Some initial plans