Deploying software is a sensitive process. It’s sensitive because we’re basically taking a working version of our application and replacing it with a new one. Yes, your team probably thoroughly tested the new version, but the doubt of whether it works or not always lingers. This post is about six common deployment risks and how to mitigate them so you’ll have more confidence when you deploy new code.
The Deployment Process and Risks
As stated above, the deployment process is inherently risky. We’re changing a working, stable version of our application to something we don’t know will work. This can be intimidating, even if you’re a developer working and deploying code by yourself. It adds even more variability if you’re working in a team with different stakeholders, like project managers, product managers, QA Engineers, developers, and DevOps engineers.
The deployment process in the latter scenario involves many people, procedures, and protocols. The potential for things to go wrong increases with every additional individual and every step added to the process. The consequences of a botched deployment can range from bad customer experience (unresponsive GUI) to malfunctioning features (why I can’t log out?) to a total disruption of service (the whole system is down). Thus, you have to mitigate the deployment risks as much as possible. No one likes to get angry calls from customers.
Deploying to the Wrong Environment
Yes. This is as silly and problematic as it sounds. It seems to be self-evident that you need to deploy the new version to the development environment first, test it, check it, and—only after everything is OK—deploy it to staging and production environments. However, accidents happen. It’s an unpleasant experience for all involved to deploy an untested version to production instead of development. Sometimes the new version contains an unstable GUI (elements popping from different places on the screen) or even an unusable one (no elements are present on the screen).
When Using Caps Is Actually Appropriate
To mitigate this risk, I suggest separating the deployment to production button to a separate deployment server, such as Jenkins or other software you may use. In bold and capital letters, put “THIS IS DEPLOYMENT TO PRODUCTION” above it. This might sound ridiculous, but it isn’t. Caps in this scenario are appropriate and do the work.
Make clear who has the authority to push the button. It’s best if you restrict this ability to a handful of people: the DevOps engineer, software development team leader, and others who know what to do if something goes wrong.
Now let’s assume that you have deployed to production correctly, but the features themselves are not yet ready. In this case, you’ll need to do a quick rollback. Setting a clear protocol for deploying to production mitigates this risk. The protocol should include the features’ definition of done, manual QA, documentation, and every other check that the code should pass before it can be marked as safe to deploy to production. This checklist ensures that only complete features and working code will be deployed to production. Aside: don’t be afraid of checklists; doctors and pilots use them to ensure complex processes are followed completely. They make a big difference.
Deploying the Wrong Code
Yes, this can happen as well. Sometimes a developer deploys the wrong Git branch of the repository. To mitigate this risk, I suggest defining which branches are deployed to production and whether you use GitHub flow or GitFlow. The rest of the branches should be automatically blocked from being deployed to production. There are multiple ways to do this, such as Git hooks. Each organization should check what works best for its setting.
Too Many User Scenarios
Following our last two points, let’s assume that we have progressed sequentially from development to production and made sure that our features work. However, manually checking all the possible user journeys is practically impossible in some scenarios. By “user journey,” we mean the different permutations of GUI screens the user can encounter when using an application. For instance, an application can have different login screens based on the user type (such as corporate vs. private), different available features per the user’s billing plan, different color schemes per the user’s personal preferences, etc. Most people can’t manually test such an overwhelming number of test cases. Moreover, even if it’s possible, it’s probably not cost-effective. In this scenario, an automated tool like Testim is needed to automatically check the different user journeys and make sure they all work as expected.
Not Deploying at All
It sounds like an oxymoron, but it’s not. It’s possible that the DevOps engineer pressed the deploy button and got a notification that everything is OK, but actually, nothing happened. The engineer didn’t deploy any code. This can occur due to a bug in the deployment pipeline. To mitigate this risk, the engineer must ensure that they have a rock-solid production deployment pipeline. Of course, you could perform manual QA on the production environment and ensure that everything is there. Manual QA is a somewhat trivial solution, but it works.
Failing to deploy your code can be costly. Let’s assume that you need to deploy a quick fix for a bug or that you need to deploy a new feature because a marketing campaign will soon launch. But for some reason, the deployment procedure is broken, and nothing happens. The whole situation culminates with angry marketing people yelling that a million dollars’ worth of marketing campaign currently runs on Facebook or Google Ads. It’s all in vain because no one can reach the site. I have witnessed such situations firsthand, and the sight isn’t pretty. So although “not deploying at all” might not sound like a deployment risk, it can be one of the most serious ones.
Crashing the Current Code Before the New One Is Up
The current code can crash before the new code is active if you take the current environment down too soon. In this scenario, the current code’s production environment can go down for a few minutes or more. If this happens before the new code is up, no one can access the site. To avoid this, we can utilize blue/green deployments or use an orchestrator like Kubernetes to do the “switch.”
Leaving the Previous Version Online
If you don’t take down the previous version after deploying your new code to production, you have two production environments: one with the new code and one with the old code. Sometimes the old environment is given a different URL (such as old.site.com), and sometimes it has the same URL. The latter scenario guarantees that some customers will use the new version while others will use the old one. To avoid this situation, we need to manually take down the old version or via an automatic mechanism.
As you can see, the deployment process is inherently risky, and many things can go wrong. However, your organization can’t standstill. You have to deploy new code to fix bugs and add new features to keep your customers happy. Following the guidelines above will help you avoid the most common risks of deployment.