Pipeline Delay due to dependency outage
Incident Report for Rollbar
Postmortem

2020-11-18 Kafka Outage

What Happened

  • At 4:48 PST, one of our infrastructure engineers noticed that our Kafka cluster had entered a corrupt state.

    • One of our Kafka hosts had entered a state where it joined as a broker twice to the cluster, under different ids.
  • Because this was a the threat to the stability of the cluster which is responsible for our product’s ingestion, the on-call engineers attempted to rebalance the cluster to remove the orphaned broker ids.

  • However the state of the corruption was greater than expected, and operations to rebalance the cluster that should have been non-blocking and allow for continued operation took extremely long and blocked ingestion.

    • Any attempts to publish to certain key topics succeeded, but due to a hung coordinator trying to rebalance the partitions attempts to consume failed.
    • Our ingestion began to break, along with other components inside our pipeline.
  • At this point we needed to offload some enqueued data in order to bring the cluster online, leaving a gap of incoming data from 5:15 - 7:01 PST.

  • This unblocked ingestion and got data coming into the Kafka cluster normally.

  • However certain things remained broken, namely our item grouping functionality.

  • By rebalancing additional topics, reconfiguring some consumer_groups we were finally able to fix core functionality at 8:08PST

  • At this point the pipeline was mostly functioning, except for our realtime search, again due to replica issues.

  • We addressed these and got everything functioning at 8:29PST

Impact

  • Incoming data was lost during the initial rebalancing process.
  • Pipeline remained down for over 3hrs.
  • Realtime search was down for an additional 21mins.

Resolution

Apache Kafka was introduced into our pipeline architecture earlier this year, and is a critical piece of the future of our backend. It is a core part of our path to scaling our product while maintaining our realtime functionality. That being said it is has become critical to our system in a very short period of time. This outage helped expose a list of improvements we can make to different components to ensure something similar doesn’t happen again.

Some of our immediate action items are:

  • Transition to a much larger number of smaller Kafka brokers to distribute the data more widely and make the cluster more fault tolerant. Additionally this will reduce the incremental overhead over rebalancing actions for individual broker outages.
  • Harden certain Kafka consumers to be more fault tolerant.
  • Transition Kafka consumer/publisher configuration to our Consul cluster allowing us to dynamically modify the values to respond to issues.
  • Add additional metrics/monitoring for cluster configuration and cluster management tasks.

    • Detect edge cases like we experienced early.

Timeline

November 18 2020 (All Times in PST):

16:48 - Cluster was determined to be in a corrupt state, attempts to remedy this began.
17:15 - Ingestion halts, pipeline stalls.
19:01 - Ingestion resumes, pipeline is still stalled.
20:08 - Pipeline resumes, realtime search is still broken.
20:29 - All functionality is restored.

Posted Nov 23, 2020 - 10:57 PST

Resolved
All components functional again.
Posted Nov 18, 2020 - 20:29 PST
Update
We have experienced a regression, and are continuing to work on the internal queuing service.
Posted Nov 18, 2020 - 20:08 PST
Update
The pipeline is normalized and functioning, however we are still having issues with our realtime search. We will update when this is resolved.
Posted Nov 18, 2020 - 19:29 PST
Monitoring
We have implemented a fix, we expect the pipeline to normalize shortly.
Posted Nov 18, 2020 - 19:16 PST
Update
We have identified the source of the issue and are working towards a fix.
Posted Nov 18, 2020 - 18:06 PST
Identified
The pipeline is currently stalled due to an outage with an internal queuing service. We are working to bring it back online.
Posted Nov 18, 2020 - 17:20 PST
This incident affected: Processing pipeline (Core Processing Pipeline).