Achieving observability in production, starting with a finite state machine

Good timing for the shirt to arrive the same week I started instrumenting things in prod :D

This week I spent some time trying to answer questions in our production systems. I planned on doing so by leveraging tracing and Honeycomb, which I’ve been evaluating on my own time. The first question I set out to answer was around a problem that has been a pain in the butt of our platform’s operators.

The problem

  • receive requests to replace nodes
  • launch replacement nodes and ensure they’re healthy
  • migrate workloads off the old nodes
  • remove the old nodes

The tool implements a finite state machine to handle the different scenarios that can occur when hosts are being replaced.The recurring problem with admiral is that it seems every so often it gets stuck or takes a vacation, but we haven’t had the time to investigate it further.

This problem increases the toil our operators have to deal with as once in a while, they have to remember to kick admiral. It also impacts our ability to manage and upgrade the fleet. Last but not least, it’s a constant source of mental anxiety for many team members.

First steps

I instrumented the code using a beeline and some good old fashion copy pasta, et voilà, a traceable finite state machine within minutes.

The code is simple, not pretty but simple, just a few more lines and it all works.

Simple, right?

You can find the complete stateful example in the stateful-tracing-fsm repo. Looking at the traces below, the code now supports tracing across application restarts.

trace without interruption
trace interrupted and resumed

Answers

The two of us dug into the data we’d been collecting. Looking through the tracing, it was easy to see that there had been exactly five requests that came in, rather than the expected ten. First piece of the puzzle was found and a misunderstanding was cleared. It turned out there were only five hosts that needed to be replaced that time. Of the five requests that were received, only one was being fulfilled. Digging deeper, we found that four of the five requests were in datacenters that currently had no capacity to increase the fleet size to migrate workloads. Second piece found and problem uncovered. In fact, those four datacenters had been stuck for multiple days without an alert going off. The state that was causing the problem was looping endlessly without either fixing the problem or alerting anyone. This was very clear to see once we had graphs to visualize it.

Outcome

Passionate about the environment and making the world a better place