There’s a scene from the Simpson episode “Lisa the Simpson” where Lisa appears on a television broadcast to make an appeal to the audience. She urges citizens to use and treasure their brain, in fear that she was about to lose her own due to the “Simpson gene”. As I was leaving my previous employer, I had a similar fear that my brain would turn to mush. I’m writing this article in the event that my true destiny of turning into a couch potato becomes fulfilled and I no longer have full use of my brain. The advice below is a little all over the map, but all of it has helped teams I worked with be more effective.
Treat your Continuous Integration Continuous Delivery tooling like production
When starting a project, it’s easy to get caught up in building software and getting it deployed somewhere. Often in that excitement, someone says “Let’s get this into CICD”, at which point, someone else will volunteer to set the thing they know about up somewhere (ie. Jenkins, concourse, drone). As soon as the software is building and deploying, boom, let’s move on to the next thing. This is a key point in the project. It’s easy to say, in the name of building more features, this works for now, it’s fine. However, this is also the point in time at which we’ll look back to, on the day where something important MUST be shipped and we can’t because Jenkins is caught in an unrecoverable state, or the concourse workers have died and for some inexplicable reason, restarting them doesn’t fix the problem. There’s no need to spend days and weeks making CICD completely bulletproof, but some simple steps can ensure even a catastrophic failure doesn’t cause the development team pain down the road.
One of the easiest ways to recover something, and also ensure it passes the bus factor test, check it in! Whether choosing to check-in configuration for the host, or create images to redeploy hosts easily, there’s more than one way to accomplish this.
No one looks at logs, that is until, something goes wrong. So it’s problematic when those logs are no longer accessible. The easy solution here is to ship the logs somewhere, anywhere. Use a centralized syslog system or some third party service. Ideally, the logs are shipped somewhere that makes monitoring and alerting on them easy.
Use your own discretion here, but some level of advanced warning when things are about to go BOOM is nice. We used to alert on Jenkins 24/7, and most of the time, these alerts could have waited until regular business hours. Some of the metrics we monitored were the usual suspects: disk, memory, CPU. Some other metrics that are handy to watch for is build times for known components.
document setup and recovery procedure
Anytime I join a team, I get to be the annoying person that asks questions about where things are documented. Often times, the thing that’s least documented is how someone put together the deployment pipeline and what to do when things go wrong with it. Document it, and test out the documentation.
apply chaos engineering practices
If you’re not familiar with chaos engineering, here’s a great website to get you started. In short, it means to break systems intentionally to figure out how to make them more resilient. I’ve worked in places that host break-a-thons or fire drills to achieve similar goals. In the case of a CICD system, I would use chaos engineering principles to test that the recovery procedure works, and that no significant data loss occurs in the case of a failure of the system.
backup and store artifacts separately
Losing the CICD is bad. Losing all the artifacts the company has shipped because there was only one copy and it lived on the CICD host can be disastrous. Yes, this has happened to me and no it wasn’t all that long ago. It was not a good week, but thankfully it was early enough in the project that we hadn’t shipped anything yet. An easy option for storing artifacts is to ship them to an online storage solution like Amazon’s S3. Concourse provides a custom resource for storing things in S3, and Jenkins provides this functionality via a plugin.
don’t maintain it
Unless an organization’s bread and butter is providing a CICD solution, it’s likely that the build pipeline is just another required tool. Don’t waste engineering time on it, if it’s in the budget, use a hosted solution.
Emit events when deploying builds
Tying in deployments with changes in the code is awesome. I like to hypothesize about the impact of a change. One of my favourite feelings is that sense of accomplishment that comes from seeing it being deployed and observing its impact on a dashboard. Being able to correlate changes in the behaviour of a system with deployments is super duper helpful when troubleshooting what went wrong. Key information I’ve found useful in the past:
- commit and link to change in source control
- link to build logs
- start and end of deployment
- phase, if applicable
I’m not about to preach about test coverage, or unit testing all the things everywhere all the time. I know that’s not practical in all cases, and in some cases, the value add of the unit tests would be close to none (ie. when the test mocks 95% of the functionality and the rest of the code is just shim around what’s being mocked). But writing tests when starting to tackle a piece of logic that is tricky, or even a piece of logic that appears trivial but requires a fair amount of code, almost always pays off. And not only does it provide coverage for regressions and more safety for other engineers, but it also feels great. I don’t know, when I write tests and I finish shipping the code, it just makes me feel like I’ve done my job right. When writing tests, here’s a couple of things to keep in mind.
make it easy
There’s a ton of tools and frameworks out there to provide additional functionality for folks writing unit tests. My general rule of thumbs here is that tools used in unit tests should not make writing and running tests harder for folks on the team. Adding barriers to writing tests will ultimately drive developers away from writing them. It’s important to keep things as simple as possible.
test what makes sense
A lot has been documented about code coverage and about quality of code based on that coverage. I’m a big fan of testing what makes sense in a codebase. It’s easy to get a false sense of security about the codebase based on the amount of the code that is covered. It’s also easy to get into the trap of writing so many tests, that writing a line of code means modifying dozens of tests in a more or less meaningless way.
Pair with people
Take the time to pair with people. No matter how important delivering a particular feature seems, ensuring others are empowered to take over whatever task at any given time is precious. This works both for the person learning and the person offloading their knowledge. In my last few weeks at my previous job, I took every opportunity I had to ensure everyone around me would know what I was working on. I’ll admit, sometimes it felt like uncovering a pile of dirt I had swept under the rug, but somehow, I still felt better after doing it.
it feels scary
It’s important to acknowledge that not everyone feels comfortable pairing. Different personalities, experience levels and biases based on previous experiences with pairing will all play a role in it. Even after years of pairing, occasionally a spike in adrenaline happens when someone asks me to pair. Imposter syndrome is real, and there’s always a chance to be found out when pairing.
Pairing can be quite intense. Writing code while continuously working with someone else on what is next can be quite taxing. It’s important to remember to take breaks, and to limit the amount of pairing done daily.
Tooling can be a real barrier when pairing. Pairing on someone’s laptop usually means the person whose laptop is being used will do all the driving, or that the other person will keep asking how to do things. This leads to an increase in frustration and eventual abandonment of the process. The best setup I’ve ever had the chance to pair included dedicated pairing stations in the office. These were separate from individual workstations. Each pairing station was setup identically, with an image that was used to wipe and reset the stations regularly. This meant no one had any attachment to the pairing station, and all the tools that were used were installed as part of the base image. Another good way to pair has been through tmux shared sessions or using the Live Share extension Visual Studio Code’s which allows everyone pairing to collaborate equally. A last resort method of pairing is using screen sharing through video conferencing software. The downside of this approach is one person is driving, while the other is passively involved.
Most of what I covered here can be experimented with cheaply in an organization. Starting with a couple of hours on a Friday and experimenting, seeing what works and what doesn’t. If something doesn’t work after having giving it an honest try, throw it away, and send me some feedback! I would love to learn from your experiences.