Approximately one month ago, we added tracing to our application so we could follow a message from ingest, to processing, to sending alerts to phones and station alerting systems. Messages that are passed from one micro-service to another are supposed to have a unique identifier in the header so we can trace messages across process boundaries. One of our microservices was not correctly adding the instrumentation, so when the message was received by the core, it would try to look up that instrumentation and fail. This caused the thread that was handling incoming messages to crash. Each core processor has 20 of these threads, and each cluster has 3 cores running. Thus, it wasn't until we received 60 bad messages from the microservices that we saw an issue with processing messages.
On November 13, we improved our monitoring to better alert us for these conditions, and the monitor triggered at 0100 ET on November 14. We restarted the service, and things were running smoothly again. After an extensive investigation on November 14, we determined the root cause and put in checks to handle missing instrumentation.