How many blades does this ceiling fan have? A real world thought experiment in observability.
I was sitting in a hair salon the other day and I found a rare moment of downtime. If you’ve ever dyed your hair and are lucky enough to have dark hair like yours truly, you know it takes a while. While I was sitting there, I resisted the urge to look at my phone and looked up instead. I noticed a ceiling fan spinning rapidly overhead. I decided to try counting the blades while it was spinning. The first thing I thought was to try to identify a specific blade as a marker to know when a rotation had completed but the blades all looked pretty similar. After a few minutes, I couldn’t figure out a clear way of getting a count from where I was sitting. That’s when I started wondering if this is exactly the kind of situation where I would use observability to answer the question. Here is a running system and a novel question I have about the system: “how many blades does this fan have?”
NOTE: This is purely a thought experiment, I don’t actually have a dashboard, monitoring or alerting in place for a ceiling fan. Also, my thought experiment makes the assumption that this is a fan that I’m operating but I did not install it. And that I have no documentation on said fan, because someone forgot to write it.
So let’s start with an easy place to start, monitoring. Has anything alerted me? Not yet! In fact I don’t think anyone would configure an alert to ensure a consistent number of blades on a fan. Alerts that would be defined here only cover the states that would trigger some form of action. I suppose if somehow a new blade grew out of the fan, that would be something weird enough that it should alert someone, but the likelihood of that happening is pretty low. Although an alert like that would have the exact data I need, too bad no one predicted I would have this question. The alerts that would be defined are things like:
- the fan stopped spinning.
- the rotation of the fan has increased or decreased by a certain percentage. This may cover the previous alert but who doesn’t like multiple alerts for the same thing.
- the room goes above a certain temperature. This may not be a direct result of fan operation, but it’s worth investigating.
- the power goes out for the building.
None of these alerts are particularly useful to answer my question. Although, if the last alert fires, it would be handy, I could hurry up and answer my question while the power people get their outage resolved.
Ok, so my monitoring doesn’t have the answer, let’s look at my metrics dashboards. I have a graph showing me the temperature for the room in which the fan operates, useful for alerting. Another graph tells me this fan has been up and running for 1287 days, which is cool, cool because you know, it is a fan. Ok, moving on from the bad jokes, aha! Here’s a piece of information that’s useful, I have a graph that shows me the number of rotations the fan completes per minute. This data was originally intended to provide the information for the alert on the velocity of the fan, but it comes in handy.
I can now go back to my spinning fan, and setup an experiment in production. I will count the number of blades I see go by a specific point for an entire minute. I then take that number and divide it by the rotations per minute and there it is, 4.6 fans! Run the experiment a couple more times and I can now get to a consistent-ish number around 5 blades.
Reflecting on the experiment
Of course, if knowing the number of blades was absolutely critical, I could kill the power to the fan, but this is a production fan. If I stop it, everyone in the room will sweat a little too much both because of the room being warmer and because of alerts going off. Alternatively, I can just take a snapshot of the running system and get all the datapoints I need to answer the question.
As I said before, this was really just a thought experiment in my continued efforts to dig into observability. I was interested in finding a way to illustrate what it is for folks that may not be intimately familiar with monitoring, metrics, logging, tracing and so forth. In a thought experiment, it’s easy to imagine that all the data is available in a single system. In reality, any kind of barrier caused by the need to navigate multiple systems only increases the time to answer the question. This in turn trains people out of the habit of frequently asking questions about systems, because the cost is too high. I also learned how valuable a snapshot of the system really is, something about a picture being worth many words.