Post Mortem - July 06, 2024

Main Database server unavailable.

All web services hosted by Vultr and required a Database connection to function.

12:15 pm ART - 05:15 pm NZST

07:24 pm ART - 12:24 am NZST

15 minutes after the incident started, the team got notified by Vultr about the outage.
2 hours into the outage the team opened the ticket with ID BJP-53CGO.
7 hours into the outage the team observed the Database with status "RUNNING" and proceeded to configure the Firewall and internal routing to getting it working again.
At 19:24 ART, the connection to all site was restored.
27 hours after the incident started we got a reply from Vultr stating:
"The host node on which your instance was previously located failed, necessitating a manual recovery of the data with the assistance of our onsite engineer. Following the recovery, the instance was migrated to a healthy node. Unfortunately, this process took longer than expected."

The team configured a second Database server within Vultr with replication to the main Database.
The team defined and consolidated SOPs for switching Databases in case of a new outage.