Achieving observability in production, starting with a finite state machine
This week I spent some time trying to answer questions in our production systems. I planned on doing so by leveraging tracing and Honeycomb, which I’ve been evaluating on my own time. The first question I set out to answer was around a problem that has been a pain in the butt of our platform’s operators.
One of the tools we use in production today manages the fleet of nodes, we call it admiral (get it? it watches the fleet). The job of this tool is pretty straight forward:
- receive requests to replace nodes
- launch replacement nodes and ensure they’re healthy
- migrate workloads off the old nodes
- remove the old nodes
The tool implements a finite state machine to handle the different scenarios that can occur when hosts are being replaced.The recurring problem with admiral is that it seems every so often it gets stuck or takes a vacation, but we haven’t had the time to investigate it further.
This problem increases the toil our operators have to deal with as once in a while, they have to remember to kick admiral. It also impacts our ability to manage and upgrade the fleet. Last but not least, it’s a constant source of mental anxiety for many team members.
Lucky for me, the Honeycomb blog had an article on the topic of instrumenting finite state machines by Bryan Fink. Our existing production application isn’t super easy to test, i know, another red flag. It also lacks a staging environment that could reproduce the issue we were seeing. So I started out by instrumenting a simple application that was implementing a finite state machine to see how quickly I could get something off the ground. I forked looplab/fsm and used the code in
alternate.go as a starting point.
I instrumented the code using a beeline and some good old fashion copy pasta, et voilà, a traceable finite state machine within minutes.
The code is simple, not pretty but simple, just a few more lines and it all works.
Unfortunately for my use-case, it wasn’t quite as easy to instrument the code in production. Each of the different state callbacks could take some time to complete, think minutes not seconds here. This means the application could be stopped and started multiple times during the time it takes for a node to be completely replaced. To address this in admiral, we implemented persistence of the state machine using Consul. The state gets stored at frequent intervals as well as when the application receives the signal to shutdown. In order for tracing to be useful here, I needed to persist the tracing information across application restarts. Otherwise I would end up with disconnected traces, likely when I needed them the most. I added some code to store the tracing information using propagation to provide continuity in my traces. This required adding some signal handling and a call to put the data provided by
span.SerializeHeaders using Consul’s key value store.
You can find the complete stateful example in the stateful-tracing-fsm repo. Looking at the traces below, the code now supports tracing across application restarts.
Cool, I was now ready to instrument the admiral code. A few minutes to write the code, another few to build and deploy it and within minutes, we started seeing valuable information roll in. We were able to determine how long replacing nodes in each datacenters took. We were able to prove that admiral was actually doing the work we expected it to, which was a great starting point. A few days after the code had been instrumented, one of my team members said “it looks like admiral is being lazy again this morning”. Usually the answer here would have been, just reset its state and restart the process. This time, I dug in a little deeper to understand the question better with questions like “what makes you say that?” or “what are you seeing?” The reason for the statement was that a job was kicked off to roll hosts and it looked like nothing was happening, the job usually kicks off ten requests simultaneously.
The two of us dug into the data we’d been collecting. Looking through the tracing, it was easy to see that there had been exactly five requests that came in, rather than the expected ten. First piece of the puzzle was found and a misunderstanding was cleared. It turned out there were only five hosts that needed to be replaced that time. Of the five requests that were received, only one was being fulfilled. Digging deeper, we found that four of the five requests were in datacenters that currently had no capacity to increase the fleet size to migrate workloads. Second piece found and problem uncovered. In fact, those four datacenters had been stuck for multiple days without an alert going off. The state that was causing the problem was looping endlessly without either fixing the problem or alerting anyone. This was very clear to see once we had graphs to visualize it.
Getting visibility into the root cause of the problem in minutes felt like complete absolute magic. Instead of wasting an hour or two digging through dozens of different logs and dashboards, we found the data and got on with solving the root cause. It was interesting to instrument the code with clear intent: to be able to answer the question “is this thing working or not?” For some reason, that’s something that’s never occurred to me before. Although we haven’t fully resolved the problems, at least we’re no longer living in in the dark. Observability FTW.