Write-up published
Resolved
Beginning at 4:10 pm PT on 9/30/21, Rollbar's processing pipeline stalled and our marketing site was partially unavailable. The stall did not affect our ingestion service but did delay processing by at most 3h and 14m. Due to our LIFO semantics, this delay was likely not perceived to be as long.
During this delay, there was potential data loss due to API tier unavailability as well as for errors that had been retried the maximum amount of times. The total window for the potential data loss was between 4:10 and 5:52 pm.
The marketing site and processing pipeline partial outage was caused by one of our non-critical database tables reaching the maximum file size for our filesystem, \(16 TB for ext4.\) This caused our processing pipeline to halt due to our inability to insert rows into the table. This in turn caused an increased load on our API tier due to internal reporting of errors to api.rollbar.com. Which led to slow API response times. And finally, the slow API response times led to our Web tier becoming unavailable to serve our marketing site.
The database table in question is primarily used to check for duplicate occurrences and will report errors to api.rollbar.com. Since this check is performed in our processing pipeline we ran into an internal loop while processing and reporting errors. This type of behavior is rare due to various safe-guards in place, \(e.g. limited retries, exponential backoff, etc.\) but does happen from time to time.
In order to guarantee that this will not happen again we will be implementing the following:
Monitoring to alert our team of DB files approaching maximum filesystem limits
Partitioning for very large DB tables \(which will make it easier to perform maintenance such as partition deletion\)
Decouple our marketing site from our web application
Resolved
Processing has fully caught up and this incident is resolved. We will write up a postmortem on this and the incident from yesterday and publish it to this status page early next week.
Monitoring
Processing delays are almost resolved and we will update once the pipeline is caught up.
Monitoring
We've implemented a fix for the DB issue and we are now monitoring the backlog while processing catches back up.
Investigating
We have identified the issue with one of our databases and are working to remediate. We are also experiencing intermittent website outages for the marketing site.
Investigating
We are continuing to investigate this issue.
Investigating
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.