On August 22, 2017, Rollbar experienced two separate outages. This doc describes the timeline and impact of each outage, the root causes we have identified, and the steps we have taken and plan to take to address those causes.
(All times are PDT)
- 9:27am - primary server on the raw database cluster crashes. Processing latency rises above normal levels.
- 9:27am through 9:47am - the same server crashes a few more times. Processing latency rises.
- 9:48am - Rollbar engineers implement a first change to reduce load on the system, mitigating the issue. Crashes stop. Processing pipeline is catching up.
- 10:12am - Rollbar engineers implement a second change to further reduce load on the system. As a side effect, the API tier experiences ~2 minutes of intermittent downtime (returning 502s).
- 10:29am - Processing pipeline is fully caught up and systems are stable.
- From approx 10:12am to 10:14am, the API tier dropped a high percentage of requests
- From approx 9:27am to 10:29am, the processing pipeline latency was higher than normal (as high as ~15 minutes for much of the period)
- 12:59pm - web tier experiences an elevated rate of slow responses on some pages
- 1:18pm - load on the fingerprinting system rises to high levels
- 1:20pm - processing latency rises above normal levels
- 1:42pm - Rollbar engineers mitigate the high load; latency begins to drop
- 1:54pm - primary server on the main database cluster crashes. Pipeline is stalled, API tier responds with 502s, and web tier is down
- 1:57pm - crashed server recovers; systems recover
- 2:30pm - processing latency back to normal levels
- From approx 1:54pm to 1:57pm, the API tier dropped a high percentage of requests
- From approx 1:20pm to 2:30pm, the processing pipeline latency was higher than normal (as high as 9 minutes)
Causes We've Identified, and What We're Doing
- The 9:48am fix (disabling an expensive analytics service) was the second time we'd made this fix — the first time, it wasn't made permanently. We've since ensured that this fix is permanent.
- The 10:12am fix (adding more memcache capacity) caused a brief outage on the API tier. We are investigating how to remove this as a SPOF on the API tier.
- The 1:18pm issue (a high number of unique items) required manual intervention to resolve. We are looking into how to automate the resolution steps.
- When the primary database server on the main cluster crashed, the API tier was affected until it came back online. We are investigating how to remove this as a SPOF on the API tier.
We know how important Rollbar's reliability is to your teams' ability to ship and operate your software, and we're sorry for the inconvenience this outage caused. Know that we take these issues seriously and are working deliberately to improve our systems.