Web Outage
Incident Report for Rollbar
Postmortem

Summary of the Incident and Impact

On February 3rd, 2024, between 03:37 and 06:35 PST Rollbar experienced a platform outage affecting the Web Application (rollbar.com) and Platform API (api.rollbar.com) servers.  The cause of these outages can be traced to an automated update by our Google Cloud Platform to Rollbar’s GKE (Google Kubernetes Engine) Clusters.

Following this incident, the trailing-12-month uptime of the API tier as measured by our external monitoring service is 99.92%.

The upgrade removed firewall rules necessary for health checks originating from Google Cloud Application Load Balancers (ALBs) required to be able to send traffic to application servers. Our default network firewall security posture is very strict and removal of rules has significant consequences as we disallow all IP traffic on the relevant ports. The removal of these firewall rules resulted in the inability of workloads on the GKE clusters to communicate with the ALBs thus causing the load balancers to register all workloads as unhealthy.

Initially, it was unclear what had happened as no code changes had been deployed by Rollbar nor were changes made directly to any infrastructure. Not knowing that the firewall rules had been eliminated, we attempted to restart applications and create new load balancers from roughly 03:37 to 05:08am. At 05:08 a support ticket was created with Rollbar’s cloud services provider, and Google to help resolve the issue. 

At 05:11 engineers from the cloud services provider, Google, and Rollbar teleconferenced to try to discuss the issue. After 75 minutes on the support call, the cloud services provider and Google were able to determine that the firewall rules had been removed due to the GKE upgrade. Starting at 06:28, Rollbar created new firewall rules and resolved the issues with load balancer health thus restoring service for the Platform API & Web Application. By 06:35, all services were fully restored.

Timeline:

  • Feb 3 03:37 PST - Both the Platform API and Web Application stop responding
  • 03:37-05:08 PST - Attempts to remedy through restarts and creating new load balancers fails
  • 05:08 PST - Critical support ticket created with our cloud support provider
  • 05:11 PST - Teleconference call initiated with cloud services provider, Google, & Rollbar engineers
  • 06:28 PST - New firewall rules recommended and added for the Web Application’s ALB
  • 06:30 PST - Web Application became available
  • 06:32 PST - New firewall rules recommended and added for the Platform API’s ALB
  • 06:35 PST - Platform API became available

Follow-up Actions

To mitigate future risks and avoid similar incidents, we have undertaken the following actions:

  • In order to avoid the deletion of necessary firewall rules, we have created our own firewall rules rather than relying on automatically-created rules.
  • We have incorporated notifications on GKE updates into our internal application performance graphs to note when these occur to help in the future when diagnosing issues.
Posted Mar 27, 2024 - 20:18 PDT

Resolved
We are continuing to monitor for any further issues.
Posted Feb 03, 2024 - 06:48 PST
Update
We are continuing to monitor for any further issues.
Posted Feb 03, 2024 - 06:37 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 03, 2024 - 06:36 PST
Update
We are continuing to work on a fix for this issue.
Posted Feb 03, 2024 - 06:35 PST
Update
We are continuing to work on a fix for this issue.
Posted Feb 03, 2024 - 06:33 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 03, 2024 - 06:27 PST
Investigating
We are experiencing a processing delay in our occurrence pipeline. We are investigating actively and will post when we have identified the issue.
Posted Feb 03, 2024 - 03:51 PST
This incident affected: Web App (rollbar.com), API Tier (api.rollbar.com) and Processing pipeline (Core Processing Pipeline, iOS Symbolication pipeline, Source map symbolication pipeline).