High API response times / 502s
Incident Report for Rollbar
Postmortem

Beginning at 1:22 pm PT on 9/29/21, Rollbar began receiving a very large amount of traffic from a distributed set of customer endpoints, essentially causing a DDoS on our API tier. Between 1:28 pm and 6:22 pm a portion of requests sent to api.rollbar.com were failing with either a 500 or a 502 error. During this time, if these requests were not retried (which is the normal behavior for Rollbar SDKs) that data was lost.

The DDoS was not an attack. In order to mitigate such a situation, each Rollbar maintained SDK is implemented to adhere to rate limit response headers in order to gracefully control the flow of data being sent to our service. In this instance, a customer was using a custom SDK that did not have this functionality. Due to the distributed nature of their product, this led to a DDoS situation.

Once our team identified the access token in question, we were able to use existing DDoS protection mechanisms to protect our internal systems from the overload. However, the amount of traffic was too much for our internal tooling and we immediately began to see sporadic failures across our network due to as yet to be determined network related issues, e.g. connection timeouts, increased packet loss, etc.

After monitoring our systems and realizing our internal DDoS mechanisms were not effective, we enabled our Cloud Provider’s DDoS prevention mechanism, Cloud Armor to prevent the traffic from reaching our internal network. This immediately resolved the network issues and allowed our systems to stabilize and settle back to normal.

Our team is working with our customers to make sure that they are properly adhering to our rate limit response headers. We are also planning an audit of our networking configuration to identify potential causes for the network degradation we were seeing during the incident. Finally, we will continue to employ our cloud provider’s DDoS protection mechanism as a last resort for situations such as these. With the existing Cloud Armor rule, we should be able to very quickly apply it to future instances of this problem while we diagnose any internal networking issues.

Posted Oct 04, 2021 - 21:50 PDT

Resolved
This incident has been resolved.
Posted Sep 29, 2021 - 19:34 PDT
Monitoring
We're seeing good health from the API tier since approximately 6:45pm PT. We'll continue to monitor.
Posted Sep 29, 2021 - 19:09 PDT
Update
We're continuing to investigate the API side of this incident; we're seeing high API response times and intermittent 502 responses from the API tier.
Posted Sep 29, 2021 - 18:06 PDT
Update
We have resolved the issue with the items list page automatically reloading. We are continuing to work on a fix for the availability for the API layer.
Posted Sep 29, 2021 - 16:38 PDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 29, 2021 - 14:25 PDT
Identified
We are receiving an extremely anomalous amount of traffic that is affecting our API layer. While we are accepting many of your errors, we are not able to process all incoming traffic. We are working to scale up our API layer in response to the traffic.
Posted Sep 29, 2021 - 14:25 PDT
Investigating
Currently, the items list page is not automatically reloading. In order to see new items, you will need to manually refresh the page. We apologize for the inconvenience.
Posted Sep 29, 2021 - 13:32 PDT
This incident affected: Web App (rollbar.com) and API Tier (api.rollbar.com).