On November 20th, a service interruption occurred due to a configuration change during infrastructure maintenance. Below, we outline what happened, the impact, and the steps we are taking to ensure this does not happen again.
Resolution
* Peering setup was incomplete due to unexecuted infrastructure management scripts.
* Internal DNS resolution relied on the newly elected primary instance, which lacked proper routing.
* No alerts were triggered for critical conditions such as 0 traffic or excessive pod restarts.
* HTTP 500 errors were not flagged for immediate action. 1.61 million HTTP 500 errors were returned over the day, with the first spike at 3:37 AM. Had we had sufficient monitoring in place for HTTP 500 spikes, we could have notified ourselves and resolved the issue faster.
* Documentation from the database vendor lacked important details for setting up peering on existing clusters.
* Improve alerting coverage for HTTP server errors and zero traffic conditions.
* Evaluate monitoring for pod restarts over a set threshold.
* Explore fault-tolerant architectures for database and API dependencies.
* Evaluate the feasibility of fail-open mechanisms for phone support.
* Investigate Station Boards and correct the bug that resulted in boards signing themselves out in response to this interruption.
* Address the bug with Station Control Unit software to ensure automation rules continue to run in the event of network instability.
* Include StatusPage details in customer onboarding emails and on the company website.
* Set up outbound phone notifications for managers during critical incidents that the team can initiate external from Bryx infrastructure.
* Conduct exercises to improve response to infrastructure failures, including spinning up environments from scratch.
We sincerely apologize for the disruption this caused and are committed to learning from this incident. Your trust is critical to us, and we are taking every step necessary to strengthen our platform’s reliability and our response processes.