Details of events (in PDT time zone): - 12:00pm Traffic has continued increasing on the hosted service at a rapid pace. To keep up with increased load, we created a 4th read replica for our main Postgres database. - 12:45pm The new read replica tried to attach to the primary and the primary crashed as a result. - 12:47pm The primary came back up, but lost all previous replicas. The only way to address this in Google Cloud is to delete the replicas and create new ones. This is a slow process and there's nothing that can be done to speed it up. - 14:31 After the additional replicas were created the primary crashed again and lost all the replicas. The primary did not start on its own and needed manual intervention to get back in a working state. - 15:00 We started creating new replicas and this time decided to turn off queries to the hosted service while that was happening to prevent the replicas or primary from falling over again. - 16:00 The first replica that was created failed and needed to be recreated - 18:00 All replicas became functional again and querying was turned back on. We continue to monitor the service for any further issues.
Posted Oct 24, 2020 - 01:00 UTC
The issue has been identified and a fix is being implemented.