The pipeline is fully caught up and we are now processing and alerting on data as it streams in. There was a short window, (~3 minutes) at the beginning of the incident, (~1pm PDT) where we dropped data. Afterward, no data was lost although processing was delayed. The primary cause has been identified and linked to the remediation steps we took during yesterday's outage. More details to follow.
We have scheduled a postmortem for this incident on Monday of next week. We will update this incident with the postmortem notes next week.
Resolved
The pipeline is fully caught up and we are now processing and alerting on data as it streams in. There was a short window, (~3 minutes) at the beginning of the incident, (~1pm PDT) where we dropped data. Afterward, no data was lost although processing was delayed. The primary cause has been identified and linked to the remediation steps we took during yesterday's outage. More details to follow.
We have scheduled a postmortem for this incident on Monday of next week. We will update this incident with the postmortem notes next week.
Monitoring
The processing pipeline is currently catching up. We have also made a change to prioritize new data. This will mean that while the system is at or over capacity, new data will be given higher priority, processed, and alerted on.
Identified
The pipeline processing latency has peaked at 60 minutes but decreasing now.
Identified
We've identified an issue causing processing latency of approximately 25 minutes. We're working to resolve the issue.