Increased occurrence processing pipeline latency

Incident Report for Rollbar

Postmortem

From 7:50 AM PT on Jan 25, 2022 until 8:34 AM PT Jan 29, 2022 we experienced a significant performance degradation in our pipeline. As a result, error occurrences sent to us over that span were processed but with significant delays. The processing delays limit our ability to notify our customers of new issues which slows their reaction time. We understand it’s paramount that we deliver on our promise of real-time issue alerts.

The root cause was a combination of degraded performance in our Kafka cluster, database deadlocks, and an increase in traffic.

Rollbar engineers were first alerted at 8:39 AM PT and began investigating the issue immediately. We pursued fixes from both the application and infrastructure sides. One primary issue was a crash loop in one service that caused constant partition rebalancing in Kafka. From the application side, we applied immediate fixes that stopped the rebalancing to stop and allowed the application to make progress. After the incident, we applied infrastructure-side fixes to address the root cause.

To prevent this incident from recurring, we have reviewed and updated our Kafka cluster monitoring, and will continue to review Kafka client code to ensure it doesn't put unnecessary stress on our infrastructure. We are also reviewing our Kafka monitoring to ensure we identify its issues proactively. Additionally, we are addressing the specific applications to address deadlock issues.

As always, thank you for being a Rollbar customer.

Posted Feb 08, 2022 - 06:20 PST

Resolved

This incident has been resolved. Please expect a postmortem by 5 PM PT Friday.

Posted Jan 25, 2022 - 17:27 PST

Update

We have implemented additional fixes, and are monitoring the results.

Posted Jan 25, 2022 - 13:14 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 25, 2022 - 10:35 PST

Identified

We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.

Posted Jan 25, 2022 - 07:50 PST

This incident affected: Processing pipeline (Core Processing Pipeline).