High API response times / 502s

Incident Report for Rollbar

Postmortem

Beginning at 1:22 pm PT on 9/29/21, Rollbar began receiving a very large amount of traffic from a distributed set of customer endpoints, essentially causing a DDoS on our API tier. Between 1:28 pm and 6:22 pm a portion of requests sent to api.rollbar.com were failing with either a 500 or a 502 error. During this time, if these requests were not retried (which is the normal behavior for Rollbar SDKs) that data was lost.

The DDoS was not an attack. In order to mitigate such a situation, each Rollbar maintained SDK is implemented to adhere to rate limit response headers in order to gracefully control the flow of data being sent to our service. In this instance, a customer was using a custom SDK that did not have this functionality. Due to the distributed nature of their product, this led to a DDoS situation.

Once our team identified the access token in question, we were able to use existing DDoS protection mechanisms to protect our internal systems from the overload. However, the amount of traffic was too much for our internal tooling and we immediately began to see sporadic failures across our network due to as yet to be determined network related issues, e.g. connection timeouts, increased packet loss, etc.

After monitoring our systems and realizing our internal DDoS mechanisms were not effective, we enabled our Cloud Provider’s DDoS prevention mechanism, Cloud Armor to prevent the traffic from reaching our internal network. This immediately resolved the network issues and allowed our systems to stabilize and settle back to normal.

Our team is working with our customers to make sure that they are properly adhering to our rate limit response headers. We are also planning an audit of our networking configuration to identify potential causes for the network degradation we were seeing during the incident. Finally, we will continue to employ our cloud provider’s DDoS protection mechanism as a last resort for situations such as these. With the existing Cloud Armor rule, we should be able to very quickly apply it to future instances of this problem while we diagnose any internal networking issues.

Posted Oct 04, 2021 - 21:50 PDT

Resolved

This incident has been resolved.

Posted Sep 29, 2021 - 19:34 PDT

Monitoring

We're seeing good health from the API tier since approximately 6:45pm PT. We'll continue to monitor.

Posted Sep 29, 2021 - 19:09 PDT

Update

We're continuing to investigate the API side of this incident; we're seeing high API response times and intermittent 502 responses from the API tier.

Posted Sep 29, 2021 - 18:06 PDT

Update

We have resolved the issue with the items list page automatically reloading. We are continuing to work on a fix for the availability for the API layer.

Posted Sep 29, 2021 - 16:38 PDT

Update

We are continuing to work on a fix for this issue.

Posted Sep 29, 2021 - 14:25 PDT

Identified

We are receiving an extremely anomalous amount of traffic that is affecting our API layer. While we are accepting many of your errors, we are not able to process all incoming traffic. We are working to scale up our API layer in response to the traffic.

Posted Sep 29, 2021 - 14:25 PDT

Investigating

Currently, the items list page is not automatically reloading. In order to see new items, you will need to manually refresh the page. We apologize for the inconvenience.

Posted Sep 29, 2021 - 13:32 PDT

This incident affected: Web App (rollbar.com) and API Tier (api.rollbar.com).