Pipeline Delay due to dependency outage

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Nov 19, 2020, 01:20 AM

04:29 AM

Updates

Write-up published

Read it here

Resolved

2020-11-18 Kafka Outage

What Happened

At 4:48 PST, one of our infrastructure engineers noticed that our Kafka cluster had entered a corrupt state.
- One of our Kafka hosts had entered a state where it joined as a broker twice to the cluster, under different ids.
Because this was a the threat to the stability of the cluster which is responsible for our product’s ingestion, the on-call engineers attempted to rebalance the cluster to remove the orphaned broker ids.
However the state of the corruption was greater than expected, and operations to rebalance the cluster that should have been non-blocking and allow for continued operation took extremely long and blocked ingestion.
- Any attempts to publish to certain key topics succeeded, but due to a hung coordinator trying to rebalance the partitions attempts to consume failed.
- Our ingestion began to break, along with other components inside our pipeline.
At this point we needed to offload some enqueued data in order to bring the cluster online, leaving a gap of incoming data from 5:15 - 7:01 PST.
This unblocked ingestion and got data coming into the Kafka cluster normally.
However certain things remained broken, namely our item grouping functionality.
By rebalancing additional topics, reconfiguring some consumer\_groups we were finally able to fix core functionality at 8:08PST
At this point the pipeline was mostly functioning, except for our realtime search, again due to replica issues.
We addressed these and got everything functioning at 8:29PST

Impact

Incoming data was lost during the initial rebalancing process.
Pipeline remained down for over 3hrs.
Realtime search was down for an additional 21mins.

Resolution

Apache Kafka was introduced into our pipeline architecture earlier this year, and is a critical piece of the future of our backend. It is a core part of our path to scaling our product while maintaining our realtime functionality. That being said it is has become critical to our system in a very short period of time. This outage helped expose a list of improvements we can make to different components to ensure something similar doesn’t happen again.

Some of our immediate action items are:

Transition to a much larger number of smaller Kafka brokers to distribute the data more widely and make the cluster more fault tolerant. Additionally this will reduce the incremental overhead over rebalancing actions for individual broker outages.
Harden certain Kafka consumers to be more fault tolerant.
Transition Kafka consumer/publisher configuration to our Consul cluster allowing us to dynamically modify the values to respond to issues.
Add additional metrics/monitoring for cluster configuration and cluster management tasks.
- Detect edge cases like we experienced early.

Timeline

November 18 2020 \(All Times in PST\):

16:48 - Cluster was determined to be in a corrupt state, attempts to remedy this began.17:15 - Ingestion halts, pipeline stalls.19:01 - Ingestion resumes, pipeline is still stalled.20:08 - Pipeline resumes, realtime search is still broken.20:29 - All functionality is restored.

Mon, Nov 23, 2020, 06:56 PM

Resolved

All components functional again.

Thu, Nov 19, 2020, 04:29 AM(4 days earlier)

Monitoring

We have experienced a regression, and are continuing to work on the internal queuing service.

Thu, Nov 19, 2020, 04:08 AM(20 minutes earlier)

Monitoring

The pipeline is normalized and functioning, however we are still having issues with our realtime search. We will update when this is resolved.

Thu, Nov 19, 2020, 03:29 AM(39 minutes earlier)

Monitoring

We have implemented a fix, we expect the pipeline to normalize shortly.

Thu, Nov 19, 2020, 03:16 AM(12 minutes earlier)

Identified

We have identified the source of the issue and are working towards a fix.

Thu, Nov 19, 2020, 02:06 AM(1 hour earlier)

Identified

The pipeline is currently stalled due to an outage with an internal queuing service. We are working to bring it back online.

Thu, Nov 19, 2020, 01:20 AM(46 minutes earlier)