From 7:50 AM PT on Jan 25, 2022 until 8:34 AM PT Jan 29, 2022 we experienced a significant performance degradation in our pipeline. As a result, error occurrences sent to us over that span were processed but with significant delays. The processing delays limit our ability to notify our customers of new issues which slows their reaction time. We understand it’s paramount that we deliver on our promise of real-time issue alerts.
The root cause was a combination of degraded performance in our Kafka cluster, database deadlocks, and an increase in traffic.
Rollbar engineers were first alerted at 8:39 AM PT and began investigating the issue immediately. We pursued fixes from both the application and infrastructure sides. One primary issue was a crash loop in one service that caused constant partition rebalancing in Kafka. From the application side, we applied immediate fixes that stopped the rebalancing to stop and allowed the application to make progress. After the incident, we applied infrastructure-side fixes to address the root cause.
To prevent this incident from recurring, we have reviewed and updated our Kafka cluster monitoring, and will continue to review Kafka client code to ensure it doesn't put unnecessary stress on our infrastructure. We are also reviewing our Kafka monitoring to ensure we identify its issues proactively. Additionally, we are addressing the specific applications to address deadlock issues.
As always, thank you for being a Rollbar customer.