DNS Resolution Issue
Incident Report for Rollbar
Postmortem

What Happened

We attempted to rollout DNSSEC as part of an effort to improve the security of our DNS records and secure our customer data. While our rollout in our testing environments succeeded, our use of multiple providers (both Amazon Route 53 and Google Cloud DNS) in production for redundancy caused issues. This inadvertently resulted in a DNS “split-brain” effect, with sporadic DNS failures until we rolled back DNSSEC. Unfortunately due to network caching some of our customers continued to experience issues long after the rollback of DNSSEC.

Next Steps

  • (COMPLETE) Ensure our testing environment DNS configuration completely mirrors production (including redundancy). This allows us to properly verify our DNS changes before rolling out to production in the future.
  • Add more alerting for DNS issues to improve visibility and allow us to be more proactive with any such issues in the future.

Impact

  • DNS queries from clients was negatively impacted with reports being sent in of sporadic failures of DNS records.

Resolution

  • DNSSEC was disabled and the change was allowed to propagate and cache.

Timeline

All Times in Pacific Standard Time

  • Tuesday, March 23:

    • 2:50 PM — 3:00 PM: DNSSEC was enabled on both AWS Route 53 and GCP Cloud DNS by adding the DS records to our DNS registrar.
    • 7:21 PM: A customer reported experiencing DNS issues, specifically SERVFAIL errors.
    • 8:00 PM: Both the DS record for AWS Route 53’s DNSKEY and the AWS Route 53 name servers were removed from our DNS registrar.
    • 9:19 PM: Another customer followed up stating that they were still seeing DNS SERVFAIL errors.
  • Wednesday, March 25:

    • 12:00 PM: DNSSEC is fully disabled; all DS records were removed from our DNS registrar.
Posted Apr 02, 2021 - 14:16 PDT

Resolved
This incident has been resolved.
Posted Mar 25, 2021 - 14:04 PDT
Monitoring
A small subset of customers are being affected by sporadic DNS resolution issue. We have identified the issue and have a fix rolling out now.
Posted Mar 25, 2021 - 12:02 PDT
This incident affected: Web App (rollbar.com).