One of the challenges in software development is to remain focused on delivering value to users. Interruptions and outages cause a lot of noise and distract engineers from planned work. Sometimes a bug escapes into production. Others, systems are on fire because of unexpected surge in traffic (at the risk of dating myself, this used to be known as the slashdot effect). Responding to these events comes with a sense of urgency and tends to raise the stress level of any engineering team. In the chaos, there is usually a lot of information floating around, many theories and much speculation.
Some time last year, our platform was troubled by a recurring problem involving our database. Each time this problem occurred, the following would happen:
- api errors triggered our monitoring system and alerted us
- users reached out due to failing or slow api requests
- graphs showed a spike in CPU at near one hundred percent utilization
- team scrambled to reduce the load on the database
- the spike subsided
- alerts cleared up, and api responsiveness returned to normal
- team continued to investigate
This caused all sorts of grief for the team. Many hours of sleep were lost, and days were sunk into trying to get to the bottom of the issue. Reproducing a database issue that occurs intermittently, days or weeks apart is always tricky. Add to it the complexity of the load being generated by hundreds of nodes distributed around the world and the problem becomes nearly impossible to stage.
We started off doing the obvious thing, finding answers on StackOverflow. Tweaking knobs on the database, each time thinking “Ah, of course this is the one we’ve been missing!” Honestly, I had no idea there were so many critical configuration settings in MySQL. Each tweak forced us to play the waiting game to see if the spikes would occur again. When they inevitably did, we upgraded the size of the VMs and waited again (when in doubt, throw money at the problem). Instrumented the database, waited. Made some more changes after digging through the pile of millions of queries available through the instrumentation, waited.
Ultimately, we were making our use of the database better. Things all around were improving, but it seemed we were no closer to solving the particular problem that we had originally set out to fix. With all the information we were gathering, we fell into the mode of “let’s try something and see if it solves it.” Too often we found that the problem went away for a period of time. This left us satisfied that this fire had been put out, only to find out later that it was still smouldering underneath the surface.
Baloney detection kit
A few months after the problems started, I read a fantastic book by Carl Sagan: “The Demon-Haunted World”. This book was written over twenty years ago, but its lessons still ring true today. It encourages readers to apply critical thinking when subjected to new information. In other words: don’t just take things at face value. It goes on to describe various methods that can be used to apply skeptical thinking to differentiate fact from fiction. The baloney detection kit is a set of tools that one can use to formulate an argument and better inform themselves.
After weeks of shooting in the dark, I was pretty much ready to try anything to get over our database woes. Taking a step back and apply a more systematic approach to our troubleshooting sounded like a good thing to do. It seemed like this problem was a perfect scenario to try and use the tools offered by the baloney detection kit.
1. Wherever possible there must be independent confirmation of the “facts”
As I said earlier, when dealing with an outage or degradation type situation, there is a lot of information from all over the place. Users of the system are a gold mine of independent sources to confirm facts.
TANGENT — I want to clarify something here, if someone is experiencing an issue, they’re experiencing it. They may be explaining it in different terms, and their mental model will likely differ from your own, but they’re experiencing an issue. Starting from a place of empathy and curiosity greatly helps build a better understanding of the problem. If at all possible, sit next to the person while they’re experiencing it, it’s absolutely eye opening. — END TANGENT
Although users are a great independent source of information, it’s important to equip them with the means to be provide helpful input. Building a simple debugging tool, or script that can be run from their machines to execute commands that can build empirical data is an easy way to accomplish this.
We created a simple shared doc to capture all the different facts about an outage and walked through them as a team. Braindumping all we knew about the outage and then digging into logs, graphs, flow diagrams and source code to determine whether or not these facts were true or not helped reduce misinformation. Some of the best questions that we asked at this stage of our investigation:
- What’s the evidence for this?
- Can you help me understand why X is true?
Confirming the facts gives us the ability to quickly debunk unquantifiable claims.
2. Encourage substantive debate on the evidence by knowledgeable proponents of all points of view
If at all possible, bring all the experts in the room to discuss the different aspects of available hypotheses. We did, and it gave us great indications of where to dig. It gave opportunities to elevate the knowledge of the team members by learning from the experience of other teams.
3. Arguments from authority carry little weight — authorities have made mistakes in the past. They will do so again in the future. Perhaps a better way to say it is that in science there are no authorities; at most, there are experts.
When a team is heads down trying to solve a major problem, nothing feels better than to have someone from high up make an uninformed decision to “solve” the problem. Rarely does this solve the problem. It undermines the confidence of the engineers and sometimes, even leads to a bigger outage.
We did not have to deal with this scenario, but we did our due diligence to ensure we were communicating our progress every step of the way. Often times, this is actually what people in positions of leadership look for. It’s when people go silent for extended periods that people get nervous.
4. Spin more than one hypothesis
Sometimes the answer to a problem may seem obvious at first. But spending some time brainstorming to generate multiple hypotheses has the ability to engage the wider team. It also provides an opportunity to identify other components that could be causing problems. In the case of the database issues we were seeing, we spent time identifying dependencies in the systems and thinking of ways to break off the dependencies. This allowed us to put together a list of several improvements that could be made to the system, in parallel and independently of each other.
5. Try not to get overly attached to a hypothesis just because it’s yours
This seems like an easy one to follow, but it’s not always the case depending on the environment fostered within an organization. In highly competitive organizations, I’ve seen individuals go as far as stealing other people’s ideas and passing them as their own to get ahead. Thankfully, our environment is focused on the success of the team in its entirety, which makes openly sharing ideas safer. The baloney detection kit suggests not to get too attached to a hypothesis, as this can lead to missing obvious flaws in it. I’ve been there myself and it’s true, if you can’t see the flaws yourself, someone else will.
6. If whatever it is you’re explaining has some measure, some numerical quantity attached to it, you’ll be much better able to discriminate among competing hypotheses
Being able to predict how a particular change will affect a system is an important piece of testing a hypothesis. Without it, how would we know that the change we made had the impact we predicted? We created a Grafana dashboard in a few minutes to give us an idea of the current state of the system. We then took a stab at making predictions on the impact of each of our proposed changes. One change would reduce in the number of connections by an estimated 75%, another would reduce the number of write operations by 90% and so forth.
Having the data on hand when making a decision of where to spend the effort is critical, but it also forces folks who propose solutions to think through the entire scope of their changes. Also, I LOVE watching a graph reflect a change as it goes into production. There’s something therapeutic about it I swear.
7. If there’s a chain of argument, every link in the chain must work (including the premise) — not just most of them
Another approach that seems obvious, but that is often overlooked. In fact, it’s so often done that there’s even a meme for it. When brainstorming hypotheses, it’s fine to be fuzzy on some of the details. However, when comparing two competing hypotheses, it’s critical to have a clear understanding of all the components in each one. Otherwise, it’s really easy to choose a hypothesis that will lead the team down a path of unknowns.
NOTE: I’m not saying that it’s a bad choice to invest time in digging into unknowns. What I am saying, is that the time to dig into those unknowns is before making a decision on which hypothesis to implement.
8. Occam’s Razor
Occam’s razor suggests that: “when faced with two hypotheses that explain the data equally well, choose the simpler one.” Usually, this leads the chosen theory to be easier to test out. For example, if there is a load spike on a system, it could well be that someone has targeted the machine, broken into it and installed a malicious application. Or, it could be that an application is eating up all the ram and causing the system to swap. Both theories could be the root cause, but disproving the latter is usually much easier than the former.
In software engineering, this rule is an easy one to follow since a simpler solution means we can usually ship sooner. It also has the added bonus of reduced complexity of the system in the long run. In our case, we took all the competing hypothese and ranked them to see which easiest to invalidate. This helped build the confidence that we had the right tools in place to get correct information on our systems.
9. Pay attention to falsifiability
Can a theory be tested? If a change is made, how will we know it has been effective? The lack of falsifiability was one of the biggest pitfalls we fell into with the waiting game approach early on in our troubleshooting. Build a dashboard with meaningful graphs. Setup monitors or alerts. Instrument the code to ensure the code path of the change is executed and behaving as expected. There’s a plethora of tools available to software engineers to provide them with visibility into the changes they make. Use them to prove or disprove a theory.
The database problem was eventually laid to rest. Significant investment was made to make this happen. Ultimately, the takeaway is that applying the scientific method when solving computer problems works. Ever since this experiment, I’ve dug deeper to get to the bottom of problems. In a talk I listened to a while ago, the presenter said that computers either work or they don’t. There’s no maybes or magic in it, and I couldn’t agree more.