Skip to main content

Post Mortem - July 06, 2024

Incident:

Main Database server unavailable.

Affected services:

All web services hosted by Vultr and required a Database connection to function.

Incident start:

12:15 pm ART - 05:15 pm NZST

Incident end:

07:24 pm ART -  12:24 am NZST

Resolution steps:

  1. 15 minutes after the incident started, the team got notified by Vultr about the outage.
  2. 2 hours into the outage the team opened the ticket with ID BJP-53CGO.
  3. 7 hours into the outage the team observed the Database with status "RUNNING" and proceeded to configure the Firewall and internal routing to getting it working again.
  4. At 19:24 ART, the connection to all site was restored.
  5. 27 hours after the incident started we got a reply from Vultr stating:
    "The host node on which your instance was previously located failed, necessitating a manual recovery of the data with the assistance of our onsite engineer. Following the recovery, the instance was migrated to a healthy node. Unfortunately, this process took longer than expected."

Mitigation steps:

  1. The team informed all clients about the issue.

Improvements and de-risking solutions:

  1. The team configured a second Database server within Vultr with replication to the main Database.
  2. The team defined and consolidated SOPs for switching Databases in case of a new outage.