One of these partitioning changes resulted in a the leading shard’s active partition reaching the maximum file size for the filesystem.
At 1:42AM PST 10/26 The partition hit the max file size for filesystem
MySQL was unable to write to the partition, and our ingestion service began to buffer incoming data in Kafka.
We attempted to reorganize the leading partition (split it into multiple smaller partitions), but this process was going to take a prohibitively long time.
Instead, it was determined the fastest and safest path forward was to roll out a new shard.
By 5:10 PST we were ingesting traffic again and by 7:28PST we were caught up
We provisioned a new shard and moved processing onto that shard. We modified the partitioning scheme going forward so that this issue would never happen again, and added additional monitoring to ensure we are notified when file sizes reach the filesystem max.
October 26 (All Times in PDT):