Observability driven development

7 min readOct 17, 2018

I got a little sidetracked with the first two deliverables of my application. I spent so much time learning what I needed to deploy my applications that I ran low on time to investigate observability, which was the point of this whole exercise. I had this plan to learn a plethora of new technologies as part of this last component of the project but decided to keep it simple and save that learning for later. This allowed me to refocus on my initial goal.

Picture, where is the picture?

I absolutely can’t stress the importance of a good diagram when it comes to understanding a system. People can spend hours explaining to me how something is put together, but until I see it for myself, I’ll likely just look like a deer in the headlight. So here’s the system as it exists today, there is an intergalactic-weatherary* API that communicates with the OpenWeatherMap API for cities on Earth, and with the planetary API.

Clearly, the station-api deployed in Azure and GCP are not deployed on the planet they’re representing

Station API, the last frontier

The component I’m adding here is pretty simple, it’s GET endpoint that will return a random string in a JSON response to represent the weather on any given planet. I’m deploying this in Google Cloud Functions, GCF, using the Serverless framework once again, I followed the Quick Start guide and that got me most of the way there.

npm install serverless-google-cloudfunctions
serverless create --template google-nodejs --path station-mars
cd station-mars
npm install
sls deploy

I took the generated sample code and added the tracing code from my previous example. Then I ran into a couple of gotchas, first, the version of serverless-google-cloudfunctions that was installed on my laptop by npm install was 1.2.0. This version didn’t support nodejs8, which is currently in beta in Google Functions and which I needed if I wanted to use Honeycomb’s nodejs beeline. Easy enough to update the package with the following line.

npm install serverless-google-cloudfunctions@2.0.0

Once that was sorted out, I ran into a different issue, the environment variables I configured in my serverless.yml file didn’t appear in my GCF dashboard. A quick search on the serverless-google-cloudfunctions GitHub pointed me to this issue and the this patch which I applied locally to get my Mars station-api where I wanted it.

https://github.com/serverless/serverless-google-cloudfunctions/pull/117/commits/8aa0bd37b0363887009ad4158cb1c30d065e72b5

I used the code I wrote in my previous article to deploy my Mercury station-api in Azure.

Debugging

With this last component in place, I’m finally ready to ask the intergalactic-weatherary about the weather around the galaxy. First I asked it to tell me about the weather in a city or two, so far so good. Next, let’s see what the weather is like on Mercury…

sls invoke -f weatherary -d '{"city":"chicago"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"city\":\"chicago\",\"weather\":\"clear sky\"}"
}
sls invoke -f weatherary -d '{"planet":"mercury"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"planet\":\"mercury\",\"weather\":\"\"}"
}

And of course, my application is busted. What the f… Thankfully, the code is pretty well instrumented at this point. Let’s take a look at what tracing can show me.

That’s weird, the request never hit the planetary-api

It looks like the code never reaches the planetary-api. This is pretty awesome, within a few seconds, I’m able to trace my application and scope down the area of the system that needs troubleshooting down to the function. Thankfully, the code in thegetPlanetaryWeather function is pretty small, it just makes a request upstream. Let’s add some more information to the trace re-deploy and now, I have everything I need to fix the problem. Taking a closer look at the additional information in the trace, I found that the upstream service was returning a 403 and looking at the URL, it’s clear that something’s not quite right here.

sls invoke -f weatherary -d '{"planet":"mercury"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"planet\":\"mercury\",\"weather\":\"403 - {\\\"message\\\":\\\"Missing Authentication Token\\\"}\"}"
}

Updated the upstream URL, and ta-daa, the bug is fixed! It looks like the weather on Mercury is “stupidly sunny”, who would have thought. Let’s see what’s happening on Mars, “invalid character ‘E’ looking for beginning of value”, that doesn’t look right.

sls invoke -f weatherary -d '{"planet":"mercury"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"planet\":\"mercury\",\"weather\":\"stupidily sunny\"}"
}
sls invoke -f weatherary -d '{"planet":"mars"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"planet\":\"mars\",\"weather\":\"invalid character 'E' looking for beginning of value\"}"
}

A little more troubleshooting, fixed a couple of missing semi-colons in my code and the Mars station-api is back in business.

sls invoke -f weatherary -d '{"planet":"mars"}'
{
    "statusCode": 200,
    "headers": {
        "Content-Type": "application/json"
    },
    "body": "{\"planet\":\"mars\",\"weather\":\"better stay inside the biodome today\"}"
}

Pretty cool to see multi-cloud tracing across three providers

Deployments

Once the code is launched in production, the next critical piece of information needed to get visibility into the health of my distributed system is identifying when deployments are going out. I’ve used Grafana annotations in the past, Honeycomb exposes a Markers API to offer a similar functionality, basically allowing you to send a bit of metadata that will be overlayed on top of your dataset. I created a simple script to deploy my services, in a production environment, I would be using pre/post hooks to do this, but my application is not that fancy yet. This bash script sends a request to the Markers API to mark the start and end time of my deployment and calls the serverless CLI to do the deployment. I ran some traffic through my service and deployed a change that visibly made things worst, the following graph shows the max duration for requests through my system.

I tried to deploy a few fixes to the bug without any success and eventually leveraged the serverless rollback functionality. The client returns a handy list of the last five deployments by default.

sls rollback
Serverless: Use a timestamp from the deploy list below to rollback to a specific version.
Run `sls rollback -t YourTimeStampHere`
Serverless: Listing deployments:
Serverless: -------------
Serverless: Timestamp: 1539757835779
Serverless: Datetime: 2018-10-17T06:30:35.779Z
Serverless: Files:
Serverless: - compiled-cloudformation-template.json
Serverless: -------------
Serverless: Timestamp: 1539757936388
Serverless: Datetime: 2018-10-17T06:32:16.388Z
Serverless: Files:
Serverless: - compiled-cloudformation-template.json

Conclusion

Although tracing and markers are only some of the components that form observability, taking these into consideration at the early stages of the development process can save oodles of time in the long run. Tracing gives me the tooling necessary to debug an application running in production without having to sift through millions of lines of code and logs from different applications. It also allowed me to quickly zero-in on the problem areas. Marking deployments and rollbacks make it easy to quickly see the impact of a change to the system. This information can then be used to create alerts when a change in pattern that isn’t expected occurs and to validate theories as code is deployed.

One of the challenges with visibility I ran into and didn’t get a chance to address was to try and hook different cloud provider services (application load balancers, databases, logging…) into my Honeycomb dataset to test out the integration with those services. Having multiple sources of information is definitely a challenge we face in production today in my day job, having the ability to tie all that information together in a timely manner is critical. As I’ve said before, this is something a provider-agnostic tool can really help with.

Observability-Driven Development, ODD, has the potential to do for DevOps what Test-Driven Development, TDD, did for software development years ago. One of the great outcomes of TDD is that it drove developer to ask the question “how would I test this?”. The process of writing tests before writing code, and then filling in the code to make those tests pass had many positive outcomes: simpler code, reduction in regressions and increased maintainability of applications. Hopefully the outcome of ODD is that it will force everyone to ask the question “how would I observe this?”

* NOTE: Intergalactic-weatherary should be read intergalactic-weather-A-REEEEE, see the following video for the inspiration of the reference