Temporary Outage of Ingestion Servers Causing a Lack of Alerting.

Incident Report for Bryx

Postmortem

Approximately one month ago, we added tracing to our application so we could follow a message from ingest, to processing, to sending alerts to phones and station alerting systems. Messages that are passed from one micro-service to another are supposed to have a unique identifier in the header so we can trace messages across process boundaries. One of our microservices was not correctly adding the instrumentation, so when the message was received by the core, it would try to look up that instrumentation and fail. This caused the thread that was handling incoming messages to crash. Each core processor has 20 of these threads, and each cluster has 3 cores running. Thus, it wasn't until we received 60 bad messages from the microservices that we saw an issue with processing messages.

On November 13, we improved our monitoring to better alert us for these conditions, and the monitor triggered at 0100 ET on November 14. We restarted the service, and things were running smoothly again. After an extensive investigation on November 14, we determined the root cause and put in checks to handle missing instrumentation.

Posted Nov 14, 2023 - 15:54 UTC

Resolved

This incident has been resolved.

Posted Nov 01, 2023 - 23:22 UTC

Update

We have not determined root cause, but we have added monitors to detect this issue and additional logging to determine true root cause.

Posted Nov 01, 2023 - 23:00 UTC

Update

We are still investigating the root cause of this issue.

To prevent behavior such as this from occurring in the future, we have added numerous monitoring alerts which will immediately page our on-call Engineering team if any ingestion issues occur.

We will provide a full post-mortem of this issue as soon as possible.

Posted Nov 01, 2023 - 20:41 UTC

Monitoring

A temporary outage of our ingestion servers from 11:40 AM EST to 12:29 PM EST caused issues with Bryx properly alerting for jobs.

Our team has isolated the issue and deployed a fix, meaning all systems are operational at this time.

We are performing a thorough investigation and will provide a full root-cause analysis as soon as possible.

Posted Nov 01, 2023 - 16:52 UTC

This incident affected: Parser.