med ∩ ml

Musings about automation

Engineers like automating tasks. You find yourself running a set of commands, copying values between files and prompts, and you think I can automate this!

Aside from the joke: “why spend 30 minutes doing something when you can spend 2 days writing a program to automate it”, I believe we don’t think enough about automation. Here I want to share some ideas.

Broken automation is more painful than manual work

Automation is wonderful until it breaks. When that happens, the fix may be easy, but maybe, having the automation in place made you forget how that task is actually done. Now you have to spend time reading code and understanding all the parts of the process. This may take a few minutes or a few days.

When an automation breaks, you can end up not knowing in what state is your application/infrastructure/server/etc. If you’re doing the steps manually, you have a better context of what has actually happened. Completed tasks, environment variables set, files written, running background processes. The more moving parts, the harder it is to debug a broken automation. On the other hand, more moving parts means higher chance of human error, which also means it’s worth automating.

Recovering from errors is also relevant. When an automation process breaks, you may find one of these situations:

  • The process automatically undoes the steps taken. This can be good to avoid having a system in a broken/intermediate state. Fixing the error requires human time (probably more than doing the task manually), hopefully the fix avoids spending more time in the future. The error may not be reproducible, testing systems, even if simple, is very hard. Maybe a recently created server didn’t respond to SSH commands for a few seconds, you can implement some tentative fixes, but what happens if the error is not reproducible? Perhaps it just happens under circumstances out of your control.

  • The process communicates what went wrong, without doing any extra steps. Good error messages are underrated. If the process can communicate exactly what went wrong and at which step, a human operator may be able to resume the execution from that point-in-time. This could also help uncover the error (if it’s reproducible). For this to happen, the automation process has to be able to start from a specific step, rather than “all or nothing” (as usually happens).

  • The process breaks. Well, now you have to understand the process, read code, check the current state of the system, try uncovering the bug, implement a fix and retry.

Full-blown automation can be fragile

You are trying to automate a process, you automate it end-to-end. You are handling all the edge cases, doing all the necessary checks, doing error recovery. The process is probably complex at this point. But now the task changes, someone else needs the process to be different. You can now add even more code to and already-complex process, or you need to revert some (if not all) of the work you’ve done. When automating processes, it’s easy to make assumptions about the system and the future requirements (or lack of).

This is similar to adhering too hard to the DRY principle. A 100% correct process can become bloated, rigid, and fragile.

Really solid automation is notoriously expensive to set up, and about as expensive to change when the circumstances fundamentally change [source].

Automate pain, not problems

We tend to jump to automation as soon as we find an issue.

Imagine a process which: Has 5 steps, all of them are manual. Step 4 takes 3 days to complete, the other steps are quick. In a couple of occasions, the process broke during step 4, forcing you to restart it. Maybe a manual step broke a couple of times after 36 hours running, making you having to restart it, those 36 hours are now lost. Following the interruption, you need to either: check which parts of step 4 had finished and try restarting it from that point, or removing everything and starting from step 1. This is a problem.

Now imagine a different process, it has 20 steps, most of them are quick. Each step requires copying and pasting content, checking different sources ensuring values are matching, running commands based on the changed content, etc. This is pain.

Let’s say both examples run every week. The first one is simply a problem we found a couple of times, the second one is error-prone and painful to do manually (which doesn’t mean is difficult). Try automating the second one, even if it’s just part of it (because full-blown automation is hard). If the first example just breaks a couple of times per year, I would be happy to re-run it from scratch when that happens.

Improving automation

Some random ideas for better automation.

Good logging and error messages

This is key. Tell your users what’s going on during the process, give all the useful feedback you can. Include at least these 3 types of messages:

  1. Before a step is about to begin. Example: generating ssh key for new user. The message can then be iteratively improved:
  • Add simple visual hints: [start] generating ssh key for new user
  • Add information relevant to the user: [start] generating ssh key with name 'id_ed25519' for new user 'admin' at ~/.ssh/
  1. When the action has finished. Using the example above: [done] generated ssh key ~/.ssh/id_ed25519 for user 'admin'
  2. When something breaks. These messages should include the step that broke, the error message (also if it comes from a subprocess). If possible, show some possible solutions based on the most common mistakes or other steps to proceed. Example:
[error] couldn't generate new ssh key for user 'admin'
	open /home/admin/.ssh/id_ed25519 failed: Permission denied.

Make sure the folder '~/.ssh' has the correct permissions (chmod 0700) and
the user 'admin' is the owner (chown -R admin:admin).

If the issue persists, you can ask for help in the #tech-help Slack channel
or check the document: 'https://www.notion.so/...'

Depending on the task, error, context, etc. you may be able to have better information in the error messages. Just try to make them as informative as possible.

Review the processes and keep improving them

Even if an automation process works, it’s worth reviewing it periodically. You may find simpler ways to do something, broken links, outdated comments, etc. Small regular reviews reduces the pain inflicted when the process breaks.

Pareto automation

As we said, solid and complete automation is hard to do right. Automating 80% of the steps can be quicker to implement (20% of the effort) and simpler to debug (20% of the time).

Some examples:

  • You need to set up multi-user VMs, this includes installing packages, configuring security and copying keys with the correct names and permissions. 80% of the tasks can probably be done with a simple Ansible playbook. Then you can manually copy the keys from your secrets’ manager and rsync them to the new VM, make sure they are in the correct location and have the correct name and permissions. A checklist can be really helpful. The alternative may include: setting up an integration between your secrets manager and your script; learning about ansible-vault to avoid publishing unencrypted keys (maybe we are running this from another automated process); learning how to copy files owned by a different user (in the remote), with a different name and with custom permissions in Ansible; add git hooks/actions to check that no unencrypted keys are committed, etc.

It’s possible to do everything, but the first 20% is a good starting point. It may turn out to be good enough, removing the need to write, maintain, document and debug more code.

  • Our previous example (A manual process involving 4 steps, one of which takes a few hours. The process breaks twice a year.). The 20% (effort) automation may be just setting up a script to run all the steps sequentially. If it breaks, you just re-run the script and save yourself from re-setting the necessary environment variables or config files. The other 80% of the effort could involve complicated retry logic based on outputs, code version, database versions, etc. This will not only be more fragile and hard to write, but maybe having to re-run the script twice a year is not that bad.

Do-nothing scripting

Probably underrated. It seems like a good entry point to incremental automation of processes.