Queries outage

Incident Report for The Graph

Resolved

This incident has been resolved.

Posted Oct 24, 2020 - 01:15 UTC

Monitoring

Details of events (in PDT time zone):
- 12:00pm Traffic has continued increasing on the hosted service at a rapid pace. To keep up with increased load, we created a 4th read replica for our main Postgres database.
- 12:45pm The new read replica tried to attach to the primary and the primary crashed as a result.
- 12:47pm The primary came back up, but lost all previous replicas. The only way to address this in Google Cloud is to delete the replicas and create new ones. This is a slow process and there's nothing that can be done to speed it up.
- 14:31 After the additional replicas were created the primary crashed again and lost all the replicas. The primary did not start on its own and needed manual intervention to get back in a working state.
- 15:00 We started creating new replicas and this time decided to turn off queries to the hosted service while that was happening to prevent the replicas or primary from falling over again.
- 16:00 The first replica that was created failed and needed to be recreated
- 18:00 All replicas became functional again and querying was turned back on. We continue to monitor the service for any further issues.

Posted Oct 24, 2020 - 01:00 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 23, 2020 - 22:15 UTC

Update

We are continuing to investigate this issue.

Posted Oct 23, 2020 - 21:51 UTC

Investigating

We are currently investigating this issue.

Posted Oct 23, 2020 - 21:51 UTC

This incident affected: Upgrade Indexer - Queries.