API Performance Degraded

Incident Report for Rollbar

Postmortem

Incident Summary

On Wednesday, July 24th from 6:16 AM (All times PDT) until 8:45 AM our API was returning HTTP 502 for a large portion of incoming traffic. This meant we weren't capturing those errors, we weren't generating notifications from the Rollbar app, and data is missing from the web UI. Our web UI performance was otherwise unaffected.

Analysis and Next Steps

At 6:10 AM we began to deploy a change to our API tier. Initial canary testing didn't reveal any issues so we rolled out the change system wide. As the change propagated, error rates began to increase significantly. Although our error rates were increasing, our health checks weren't failing at a rate sufficient to trigger an alert. The error we'd introduced was causing our backend processes to lock up as they received real API traffic, however the process pool is large enough and our load balancing is configured such that a the health checks failures were intermittent.

At 6:55 AM the error rate finally reached the threshold to trigger an alert. However, the alert was from a system monitoring a little-used API endpoint we provide to support some legacy clients that use deprecated SSL ciphers. The alert resolved itself immediately. The engineer who received the alert wasn't aware of the recent API deploy and disregarded the alert as the endpoint is very rarely used, has a much lower implicit internal SLA than our primary api.rollbar.com endpoint, and is running at a legacy provider know to have intermittent networking issues. The issue was affecting api.rollbar.com as well, however a misconfiguration in our notification routing meant we weren't seeing those alerts.

At 7:39 AM we noticed the elevated error rate on a performance graph, began our investigation, and at 7:53 AM we reverted the API change we'd made and that change began to propagate. Error rates started to decrease and the system returned to normal by 8:45 AM.

As you can imagine, as a company in the monitoring industry it's frustrating to have such a fundamental failure in how we monitor our own systems, especially after having a similar issue a month ago. As we said then, we're working on overhauling our monitoring and alerting systems along with related procedures around deployment and on-call response, but we obviously still have some work to do.

We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at support@rollbar.com.

Posted Jul 25, 2019 - 12:03 PDT

Resolved

This issues has been resolved and the API is once again processing data normally.

Posted Jul 24, 2019 - 08:55 PDT

Update

We believe we've identified the problem and response rates should be returning to normal over the next 10-15 minutes.

Posted Jul 24, 2019 - 08:03 PDT

Update

We are continuing to investigate this issue.

Posted Jul 24, 2019 - 08:02 PDT

Investigating

A large fraction of requests to api.rollbar.com are currently returning 502. We're investigating.

Posted Jul 24, 2019 - 07:50 PDT

This incident affected: API Tier (api.rollbar.com).