Processing pipeline halt

Resolved·Full outage

We’ve published a write-up of this incidentRead the write-up

Read it here

Affected components

Oct 26, 2020, 09:27 AM

12:14 PM

Updates

Write-up published

Read it here

Resolved

What Happened

Several months ago we made some changes to our partitioning model for our raw item storage. These changes enabled us to migrate onto much larger disk sizes, reducing our operational overhead and improving our stability.
One of these partitioning changes resulted in a the leading shard’s active partition reaching the maximum file size for the filesystem.
At 1:42AM PST 10/26 The partition hit the max file size for filesystem
MySQL was unable to write to the partition, and our ingestion service began to buffer incoming data in Kafka.
We attempted to reorganize the leading partition \(split it into multiple smaller partitions\), but this process was going to take a prohibitively long time.
Instead, it was determined the fastest and safest path forward was to roll out a new shard.
By 5:10 PST we were ingesting traffic again and by 7:28PST we were caught up

Impact

No data was lost, no downtime for the API or Web App.
The processing pipeline was fully halted from 1:41AM PST - 5:08AM PST.

Resolution

We provisioned a new shard and moved processing onto that shard. We modified the partitioning scheme going forward so that this issue would never happen again, and added additional monitoring to ensure we are notified when file sizes reach the filesystem max.

Timeline

October 26 \(All Times in PDT\):

1:41: AM: Leading partition hits filesystem max and ingestion halts.
3:17 AM: After unsuccessful attempts to reorganize the partition on the fly, it is determined best path forward is to provision a new shard.
5:08 AM: New shard is in service and pipeline is processing again.
7:28 AM: Pipeline is fully recovered and all components are operational again.

Mon, Nov 2, 2020, 09:18 PM

Resolved

This incident has been resolved. Incoming events are processed and notifications are sent in real time, prioritized before the backlog of events accumulated during the incident.

Mon, Oct 26, 2020, 12:14 PM(1 week earlier)

Monitoring

A fix has been implemented and we are monitoring the results.

Mon, Oct 26, 2020, 11:46 AM(27 minutes earlier)

Identified

We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 20 minutes, after that the processing pipeline will be operational again.

Mon, Oct 26, 2020, 11:33 AM(12 minutes earlier)

Identified

We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 15 minutes.

Mon, Oct 26, 2020, 11:10 AM(23 minutes earlier)

Identified

We are continuing to work on a fix for this issue. We estimate the fix to take effect in about 30 minutes.

Mon, Oct 26, 2020, 10:37 AM(32 minutes earlier)

Identified

We are continuing to work on a fix for this issue. We estimate the fix to take effect in about one hour; until that new events are not showing up in the item list and notifications are not triggered.

Mon, Oct 26, 2020, 10:17 AM(19 minutes earlier)

Identified

The issue has been identified and a fix is being implemented.

Mon, Oct 26, 2020, 09:36 AM(41 minutes earlier)

Investigating

We are currently investigating this issue.

Mon, Oct 26, 2020, 09:27 AM