On Wednesday, July 24th from 6:16 AM (All times PDT) until 8:45 AM our API was returning HTTP 502 for a large portion of incoming traffic. This meant we weren't capturing those errors, we weren't generating notifications from the Rollbar app, and data is missing from the web UI. Our web UI performance was otherwise unaffected.
Analysis and Next Steps
At 6:10 AM we began to deploy a change to our API tier. Initial canary testing didn't reveal any issues so we rolled out the change system wide. As the change propagated, error rates began to increase significantly. Although our error rates were increasing, our health checks weren't failing at a rate sufficient to trigger an alert. The error we'd introduced was causing our backend processes to lock up as they received real API traffic, however the process pool is large enough and our load balancing is configured such that a the health checks failures were intermittent.
At 6:55 AM the error rate finally reached the threshold to trigger an alert. However, the alert was from a system monitoring a little-used API endpoint we provide to support some legacy clients that use deprecated SSL ciphers. The alert resolved itself immediately. The engineer who received the alert wasn't aware of the recent API deploy and disregarded the alert as the endpoint is very rarely used, has a much lower implicit internal SLA than our primary api.rollbar.com endpoint, and is running at a legacy provider know to have intermittent networking issues. The issue was affecting api.rollbar.com as well, however a misconfiguration in our notification routing meant we weren't seeing those alerts.
At 7:39 AM we noticed the elevated error rate on a performance graph, began our investigation, and at 7:53 AM we reverted the API change we'd made and that change began to propagate. Error rates started to decrease and the system returned to normal by 8:45 AM.
As you can imagine, as a company in the monitoring industry it's frustrating to have such a fundamental failure in how we monitor our own systems, especially after having a similar issue a month ago. As we said then, we're working on overhauling our monitoring and alerting systems along with related procedures around deployment and on-call response, but we obviously still have some work to do.
We know that you rely on Rollbar and we’re sorry to let you down. If you have any questions, please don't hesitate to contact us at firstname.lastname@example.org.