Post Mortem - July 06, 2024
Incident:
Main Database server unavailable.
Affected services:
All web services hosted by Vultr and required a Database connection to function.
Incident start:
12:15 pm ART - 05:15 pm NZST
Incident end:
07:24 pm ART - 12:24 am NZST
Resolution steps:
- 15 minutes after the incident started, the team got notified by Vultr about the outage.
- 2 hours into the outage the team opened the ticket with ID BJP-53CGO.
- 7 hours into the outage the team observed the Database with status "RUNNING" and proceeded to configure the Firewall and internal routing to getting it working again.
- At 19:24 ART, the connection to all site was restored.
- 27 hours after the incident started we got a reply from Vultr stating:
"The host node on which your instance was previously located failed, necessitating a manual recovery of the data with the assistance of our onsite engineer. Following the recovery, the instance was migrated to a healthy node. Unfortunately, this process took longer than expected."
Mitigation steps:
- The team informed all clients about the issue.
Improvements and de-risking solutions:
- The team configured a second Database server within Vultr with replication to the main Database.
- The team defined and consolidated SOPs for switching Databases in case of a new outage.
No comments to display
No comments to display