Rollbar experienced a major outage caused by an accident triggered by human error.
While validating the health of the primary database, an engineer accidentally triggered the installation of the earlyoom
package causing processes to be killed until the erroneous package was removed due to the high memory utilization of the primary database server.
The Web UI became unavailable to all users and data was not being processed for a period of approximately 30 minutes.
Manual failover procedures were initiated to switch to the existing replica and services were restarted.
earlyoom
.mysqld
restarted on the primary database instance entering crash recovery; mysqld
was then halted on the primary instance to prevent split-brain issues.