At 4:48 PST, one of our infrastructure engineers noticed that our Kafka cluster had entered a corrupt state.
Because this was a the threat to the stability of the cluster which is responsible for our product’s ingestion, the on-call engineers attempted to rebalance the cluster to remove the orphaned broker ids.
However the state of the corruption was greater than expected, and operations to rebalance the cluster that should have been non-blocking and allow for continued operation took extremely long and blocked ingestion.
At this point we needed to offload some enqueued data in order to bring the cluster online, leaving a gap of incoming data from 5:15 - 7:01 PST.
This unblocked ingestion and got data coming into the Kafka cluster normally.
However certain things remained broken, namely our item grouping functionality.
By rebalancing additional topics, reconfiguring some consumer_groups we were finally able to fix core functionality at 8:08PST
At this point the pipeline was mostly functioning, except for our realtime search, again due to replica issues.
We addressed these and got everything functioning at 8:29PST
Apache Kafka was introduced into our pipeline architecture earlier this year, and is a critical piece of the future of our backend. It is a core part of our path to scaling our product while maintaining our realtime functionality. That being said it is has become critical to our system in a very short period of time. This outage helped expose a list of improvements we can make to different components to ensure something similar doesn’t happen again.
Some of our immediate action items are:
Add additional metrics/monitoring for cluster configuration and cluster management tasks.
November 18 2020 (All Times in PST):
16:48 - Cluster was determined to be in a corrupt state, attempts to remedy this began.
17:15 - Ingestion halts, pipeline stalls.
19:01 - Ingestion resumes, pipeline is still stalled.
20:08 - Pipeline resumes, realtime search is still broken.
20:29 - All functionality is restored.