Beginning at 1:22 pm PT on 9/29/21, Rollbar began receiving a very large amount of traffic from a distributed set of customer endpoints, essentially causing a DDoS on our API tier. Between 1:28 pm and 6:22 pm a portion of requests sent to api.rollbar.com were failing with either a 500 or a 502 error. During this time, if these requests were not retried (which is the normal behavior for Rollbar SDKs) that data was lost.
The DDoS was not an attack. In order to mitigate such a situation, each Rollbar maintained SDK is implemented to adhere to rate limit response headers in order to gracefully control the flow of data being sent to our service. In this instance, a customer was using a custom SDK that did not have this functionality. Due to the distributed nature of their product, this led to a DDoS situation.
Once our team identified the access token in question, we were able to use existing DDoS protection mechanisms to protect our internal systems from the overload. However, the amount of traffic was too much for our internal tooling and we immediately began to see sporadic failures across our network due to as yet to be determined network related issues, e.g. connection timeouts, increased packet loss, etc.
After monitoring our systems and realizing our internal DDoS mechanisms were not effective, we enabled our Cloud Provider’s DDoS prevention mechanism, Cloud Armor to prevent the traffic from reaching our internal network. This immediately resolved the network issues and allowed our systems to stabilize and settle back to normal.
Our team is working with our customers to make sure that they are properly adhering to our rate limit response headers. We are also planning an audit of our networking configuration to identify potential causes for the network degradation we were seeing during the incident. Finally, we will continue to employ our cloud provider’s DDoS protection mechanism as a last resort for situations such as these. With the existing Cloud Armor rule, we should be able to very quickly apply it to future instances of this problem while we diagnose any internal networking issues.