Distributed applications come in all shapes and sizes. In many cases, an Application Programming Interface (API) talks to another one and distributed tracing through Hyper Text Transfer Protocol (HTTP) headers works just great. But what about other cases? Can distributed tracing work for event driven architecture? What if instead of a direct request, a message queue is used to trigger an action in another part of the system?
A little while ago I started working with Redis channels to pass information from one service to another. If you haven’t played with Redis Pub/Sub, it’s a pretty simple and convenient way to send notifications from a service to N subscribers through a channel. I started digging into how I could use it as the bus for the tracing information as well.
Some background on distributed tracing
For those of you who aren’t familiar with the basic concepts of distributed tracing, here’s the quick rundown.
A trace represents an event or a transaction through a system. Each trace has a unique trace ID. When I think of traces, I think of what would matter the most to a user of my system. What is the thing that my application does? What impacts the value of what my system is delivering?
A span represents a unit of work, a function call or a task that your system performs. Spans are tied to a trace via the trace ID, and each span has its own unique span ID. Spans can also be related to other spans through a parent span ID. This will be used when displaying the tracing information to order the spans in a hierarchy. In most cases, the span also collect the timing information.
The trace context is the metadata associated with the current span and trace. This context is the data that must be passed between components in order to establish continuity in the trace. A context includes the relevant IDs as well as sampling information.
A quick note on some of the tools I’m using in building this example that you may not have heard of before:
- Redis: An open-source key-value store that provides a pubsub interface which my application uses as a message bus.
- OpenCensus: An open-source set of libraries that I will use in my application to send metrics and tracing information to a collector.
- Jaeger: An open-source project that provides an OpenCensus compatible exporter, a collector for my tracing information and a powerful user interface to display the distributed traces.
In this example, my application has an API that receives POST requests to generate reports. These reports take some time to generate, so it needs to be done asynchronously. Once a request has been received, the API publishes a message to Redis. Workers that will generate the report receive the message by subscribing to the channel, generate the report and then make it available to the user. In order to capture a transaction through this system, we’ll need to record everything from the initial request all the way through to the last report generated.
I’ve instrumented the application using OpenCensus. I chose it as I wanted to use the handy Redis library that already exists for it. The first step was to ensure I could get each component sending tracing information separately. Instrumenting the API was as easy as using the
ochttp handler wrapper and the
redigo Redis library:
Rebuild and deploy the code. Then popped open a browser and took a look at the Jaeger interface running on http://localhost:16686:
Not quite as useful as it could be, but at least the basics of getting data from my services were functional. Now for the fun bit. Distributed tracing is really not a whole lot more complicated than passing some metadata between services to tie together context across components. This concept is called propagation.
In order to propagate the context, I need to put the trace information into the message I publish into the channel. I implemented a simple wrapper around the OpenCensus Binary propagator, which I’ll call whenever I publish or receive a message. The code is pretty simple and appends the message to the end of the slice of bytes:
Using it looks like this:
Now we can see the transaction across all the components of the application.
And that’s really it! It’s now possible to see the duration of all the operations of the system. The code is all available on GitHub. A docker-compose file spins up the API, a few workers along with a container running Redis. There’s also a container running Jaeger to collect and render the tracing information. Feel free to reach out if you have questions or comments!