Bryx Service Outage

Incident Report for Bryx

Postmortem

Summary

On November 20th, a service interruption occurred due to a configuration change during infrastructure maintenance. Below, we outline what happened, the impact, and the steps we are taking to ensure this does not happen again.

What Happened?

  • At approximately 4:00PM EST on November 19th, a configuration change was applied to establish peering for improved security and speed. Network peering allows for communication between machines within an internal network, rather than going out to the wider internet. This increases both speed and security of the system.
  • Around 3:45 AM EST on November 20th, the database cluster elected a new primary instance, which was inadvertently set to a node not reachable within the network due to incomplete route configurations.
  • The routing issue occurred because a provisioning process essential for configuring network routing was not executed.
  • This caused API and service instability as health checks repeatedly failed.
  • The incident was discovered only after a ticket submission, which prolonged downtime.
  • The issue was resolved by 5:50 AM by terminating the peering configuration and reverting to standard network traffic.

Impact

  • Service outage notifications sent from the API failed to reach customers.
  • Station Boards were force signed out and needed to be manually re-paired.
  • PIN-based customer support through the phone system, was disrupted as it relies on APIs affected by the outage.
  • Station Alerting firmware version 4.5.3 restarted itself repeatedly upon recognizing network instability, which caused some missed alerts even with GPIO-based backups in place.

Resolution

  • Upon delayed notification of the service interruption, our team reverted the peering changes and restored connectivity to the existing database clusters.

Root Causes

  1. Configuration Missteps:
* Peering setup was incomplete due to unexecuted infrastructure management scripts.
* Internal DNS resolution relied on the newly elected primary instance, which lacked proper routing.
  1. Monitoring Gaps:
* No alerts were triggered for critical conditions such as 0 traffic or excessive pod restarts.
* HTTP 500 errors were not flagged for immediate action. 1.61 million HTTP 500 errors were returned over the day, with the first spike at 3:37 AM. Had we had sufficient monitoring in place for HTTP 500 spikes, we could have notified ourselves and resolved the issue faster.
  1. Documentation Deficiencies:
* Documentation from the database vendor lacked important details for setting up peering on existing clusters.

Corrective Actions

  • Immediate improvements to monitoring policies are in progress to address alerting gaps.
  • StatusPage visibility is being enhanced for better customer communication.
  • To ensure resilience and prevent future incidents, we are implementing the following changes:
  1. Monitoring Enhancements:
* Improve alerting coverage for HTTP server errors and zero traffic conditions.
* Evaluate monitoring for pod restarts over a set threshold.
  1. Resilience Improvements:
* Explore fault-tolerant architectures for database and API dependencies.
* Evaluate the feasibility of fail-open mechanisms for phone support.
* Investigate Station Boards and correct the bug that resulted in boards signing themselves out in response to this interruption.
* Address the bug with Station Control Unit software to ensure automation rules continue to run in the event of network instability.
  1. Communication Improvements:
* Include StatusPage details in customer onboarding emails and on the company website.
* Set up outbound phone notifications for managers during critical incidents that the team can initiate external from Bryx infrastructure.
  1. Team Training:
* Conduct exercises to improve response to infrastructure failures, including spinning up environments from scratch.

We sincerely apologize for the disruption this caused and are committed to learning from this incident. Your trust is critical to us, and we are taking every step necessary to strengthen our platform’s reliability and our response processes.

Posted Nov 26, 2024 - 20:57 UTC

Resolved

This incident has been resolved.
Posted Nov 21, 2024 - 12:22 UTC

Monitoring

All services are functioning. We will continue to monitor this to confirm reliability.
Posted Nov 20, 2024 - 13:26 UTC

Update

We have deployed a fix and are currently monitoring for stability.
Posted Nov 20, 2024 - 11:01 UTC

Update

We are continuing to investigate this issue.
Posted Nov 20, 2024 - 11:00 UTC

Update

We are continuing to investigate this issue.
Posted Nov 20, 2024 - 10:34 UTC

Investigating

We are currently investigating a partial outage affecting all Bryx services.
Posted Nov 20, 2024 - 10:14 UTC
This incident affected: Websites, API, Geocoder, Parser, Ingest Email, and SNPP Service.