One Decision That Cut Our Deploy Risk in Half
The Bet Nobody Wanted to Make
Most teams treat deploys like controlled explosions. You light the fuse, watch the dashboard, and hope the blast radius stays small. When something goes wrong, you scramble. The scramble is the plan.
We did this for a long time. And for a long time, our incident rate reflected it.
Then we made one structural decision that changed the trajectory: every deploy must be independently reversible. Not "rollback the whole release." Not "hotfix within thirty minutes." Each change that reaches production must be removable on its own, without dragging other changes with it.
It sounds simple. It was the hardest operational argument we had all year.
What "Independently Reversible" Actually Means
A normal rollback takes the entire release and rewinds it. If your release contains five changes and one of them breaks something, you lose all five. The four good changes go back to staging. The team re-tests, re-queues, and re-deploys — sometimes days later.
Independent reversibility means each change ships in a way that lets you pull it back without touching anything else that went out the same day. Think of it like removing one brick from a wall without the row above it collapsing.
The constraint forces specific behavior upstream:
- Changes cannot share mutable state transitions with other changes in the same window.
- Data-shape changes go out before the code that depends on them, with enough space between to verify each step.
- Feature exposure is separated from feature existence. Something can be deployed but not yet active.
None of this is novel. What made it hard was making it mandatory.
The Resistance Was Predictable
Engineers pushed back immediately, and honestly. Their concerns were legitimate:
"This slows us down." It does add steps. Splitting a release that used to be one artifact into two or three sequenced deploys takes planning. Early on, the overhead felt steep — roughly 20 to 30 percent more time between "code complete" and "live in production."
"We already test well enough." Testing catches a category of problems. It does not catch the interaction between your change and a customer's specific usage pattern at a specific scale at a specific time of day. Production is the final test environment. The question is how expensive a failed test is.
"Rollback already works." Whole-release rollback works when one thing breaks. It works poorly when two things ship and one of them is a database migration that shouldn't be reversed. At that point, rollback becomes a negotiation, not a button press.
We heard the objections, acknowledged the cost, and held the line.
Two Quarters Later
We measured customer-facing incidents — any event where a customer's API call returned an unexpected error or degraded response traced to a deploy.
Over the two quarters before the policy: 14 incidents.
Over the two quarters after: 6.
The incidents that did happen resolved faster. When you can remove exactly the change that caused the problem, mean time to recovery drops because diagnosis is half-finished before you start. You already know the boundary.
The velocity concern faded. Teams internalized the sequencing habit. After the first month, overhead shrank because people designed for reversibility from the start rather than retrofitting it at deploy time.
Why This Worked When Other Things Didn't
We had tried other approaches to reducing deploy risk. Better test coverage. Staged rollouts to subsets of traffic. Post-deploy checklists. Each helped marginally. None bent the curve.
The difference with independent reversibility: it changes the decision boundary, not the execution quality. You're not asking people to write better code or catch more bugs. You're restructuring the unit of risk so that when something goes wrong — and it will — the blast radius is one change, not one release.
Better code is a gradient. Better boundaries are a step function.
The Broader Lesson
Reliability improvements that depend on people being more careful don't compound. People have bad days. Attention drifts. Complexity grows.
Reliability improvements that change the structure of decisions do compound, because they work even on bad days. The rule "each change must be independently reversible" doesn't require vigilance. It requires compliance with a constraint that becomes habit.
If you're staring at an incident graph that won't bend, stop asking whether your team can execute better. Ask whether the shape of your releases is forcing unnecessary risk. The answer is usually yes.
The best decision we made last year wasn't a technology choice. It was a deploy policy that nobody wanted and everybody now defends.
0 comments
Be the first to comment.