API is marking itself unhealthy causing SCUs to disconnect

Incident Report for Bryx

Postmortem

An missing index on a collection used for SCU logging was causing queries to take too long. Incoming health check messages would fail because the database was taking too long to respond, and mark the API offline.

We have added the missing index to the collection, and are working on better monitoring to detect this failure mode.

Posted Aug 30, 2023 - 12:39 UTC

Resolved

SCUs are going offline and back online because the API is marked unhealthy.
Posted Aug 30, 2023 - 11:00 UTC