Every team has the same incident in its postmortem archive. The Friday push that cascaded across three services. The release that landed while two P1s were already burning. The config tweak nobody realized was wired into fourteen microservices through a shared dependency.
None of those failures were code failures. They were context failures. CI was green. The diff looked surgical. The system around the deploy was the part that broke.
The pre-deploy risk score closes that gap. The agent fires the moment a deploy enters the queue, pulls signals across infrastructure and team state, and returns one of three verdicts — HOLD, PROCEED, or WATCH — with a numerical confidence and a plain-English breakdown of which signals moved the needle. Gut feel becomes a measurement. Tribal knowledge becomes a check the pipeline enforces.
Green CI Is Not the Same Thing as a Safe Deploy
Mature systems rarely ship broken code. They ship correct code into a state that cannot absorb it.
Most deployment failures in mature systems are not bad-code failures. They are context collisions — a technically valid change meeting an environment that was not ready for it. A migration that runs cleanly in staging, then deadlocks against an active A/B test that doubled write traffic on the affected table. A flag rollout that touches the same API surface as the incident remediation an on-call engineer started forty minutes earlier.
A 2025 Overmind analysis found that a substantial share of deploy-related incidents involved infrastructure dependencies the deploying engineer did not know existed[1] — with estimates that more than half of contextual failures are predictable given better dependency visibility.[4] The artifact was sound. The conditions around it were not. The pre-deploy risk score scores the deployment context, not the artifact.
Six Signals. Each One Measures a Different Failure Class.
Every signal is independently sourced and independently weighted. Drop any one and the verdict moves.
- [01]
Active A/B Experiments
The agent queries the experimentation platform — LaunchDarkly, Split, Statsig — for every running experiment that touches the same services or routes the deploy modifies. Overlap is the failure mode. Deploy-induced variance corrupts experiment results, and experiment-induced traffic skew amplifies whatever side effects the deploy carries. Both directions burn weeks of analysis and a quarter of the experimentation roadmap.
- [02]
On-Call Engineer Incident Load
A deploy landing on top of an on-call already managing two open incidents means slower response when something else breaks. The agent reads PagerDuty or Opsgenie for the current rotation and pulls open and recently-resolved incident counts over the last 48 hours. High fatigue forces a WATCH or HOLD. Responder bandwidth is a deployment signal, not a soft factor.
- [03]
Blast Radius from the Dependency Graph
The heaviest signal. The agent parses Terraform state, CloudFormation stacks, or Kubernetes manifests into a runtime dependency graph, then traces which downstream services, databases, and queues the modified resources actually reach. A change to a shared VPC security group has nothing in common with a change to one Lambda, and the score has to reflect that.[2]
- [04]
Seven-Day Incident Rate
If the target service has been unstable in the past week, layering more change onto it compounds the instability. The agent pulls incident history from the ITSM tool, weights by severity, then applies a recency decay. A service with two P2s in the last 72 hours scores radically different from one that has been quiet for a quarter.
- [05]
Outstanding P2 Bugs
Open high-priority bugs are evidence that the codebase is already carrying instability nobody has paid down. The agent pulls Jira or Linear for open P2s tagged against the affected services. Each unresolved bug is a place where additional change is more likely to reveal something nobody scoped.
- [06]
Time-to-Friday and Calendar Pressure
Deploying at 4:47 PM on a Friday before a holiday weekend is structurally different from deploying Tuesday at 10 AM. Not because Friday is cursed, but because the responder pool shrinks, the on-call engineer is one person, and any incident that does land has to survive the weekend on whoever happens to be carrying the pager. The agent calculates hours to end-of-business Friday, checks the company holiday calendar, and factors in the geographic spread of the on-call team. Staffing reality, not superstition.
Building the Blast Radius Estimator From IaC State
Pull the dependency graph from Terraform and CloudFormation. Score what the change actually reaches.
Blast radius is the heaviest signal and the most technically involved component of the score. It earns its own treatment. The mechanism is straightforward: parse Infrastructure-as-Code state into a directed acyclic graph of resource dependencies, then count how many nodes are reachable from the set of resources the deploy modifies.
For Terraform, the entry point is terraform graph or parsing the state file directly. The open-source blast-radius project pioneered interactive visualization of these dependency graphs in d3.js.[2] For production scoring you do not need the visualization — pipe terraform graph -type=plan into a parser that extracts nodes and edges, then run a breadth-first traversal from the changed resources outward.[3]
CloudFormation behaves differently. Stacks expose DependsOn relationships explicitly, and the AWS API returns the stack's resource list on demand. The trap is cross-stack references: when one stack exports a value another imports, the dependency is implicit but the blast radius is real. The parser has to follow Fn::ImportValue references across stack boundaries, or the score under-reads coupling that production absolutely respects.[5]
The scoring formula weights direct dependents more heavily than transitive ones, with a decay factor at each hop. A change that hits 3 services directly and reaches 12 more transitively scores differently from one that hits 12 services directly with no transitive reach. Same node count, very different blast radius.
blast-radius-scorer.tsinterface DependencyNode {
resourceId: string;
resourceType: string;
directDependents: string[];
transitiveDependents: string[];
}
function calculateBlastRadius(
changedResources: string[],
graph: Map<string, DependencyNode>
): { score: number; affectedServices: string[]; depth: number } {
const visited = new Set<string>();
const queue: { id: string; depth: number }[] = [];
let maxDepth = 0;
// Seed the BFS with the resources the deploy actually touches
for (const id of changedResources) {
queue.push({ id, depth: 0 });
visited.add(id);
}
// Walk outward through the dependency graph
let score = 0;
while (queue.length > 0) {
const { id, depth } = queue.shift()!;
const node = graph.get(id);
if (!node) continue;
// Decay: direct = 1.0, each hop multiplies by 0.6
const depthWeight = Math.pow(0.6, depth);
score += depthWeight;
maxDepth = Math.max(maxDepth, depth);
for (const dep of node.directDependents) {
if (!visited.has(dep)) {
visited.add(dep);
queue.push({ id: dep, depth: depth + 1 });
}
}
}
return {
score: Math.round(score * 100) / 100,
affectedServices: [...visited],
depth: maxDepth,
};
}Six Sub-Scores, One Verdict, Zero Ambiguity
How the weighted sum collapses to HOLD, WATCH, or PROCEED — and where the thresholds bite.
| Signal | Weight | Low (0-3) | Medium (4-6) | High (7-10) |
|---|---|---|---|---|
| Blast Radius | 0.25 | ≤2 direct deps | 3-8 direct deps |
|
| 7-Day Incident Rate | 0.20 | 0-1 incidents | 2-3 incidents | 4+ or any P1 |
| On-Call Load | 0.15 | 0 open incidents | 1-2 open incidents | 3+ or recent P1 |
| A/B Experiments | 0.15 | 0 overlapping | 1-2 overlapping | 3+ overlapping |
| Outstanding P2 Bugs | 0.10 | 0-1 open | 2-4 open | 5+ open |
| Time-to-Friday | 0.15 |
| 8-24 hours | <8 hours |
The final score is a weighted sum of normalized sub-scores between 0 and 10. The verdict mapping below is a starting point — calibrate the thresholds against your own incident history after 60-90 days of outcome data:
- PROCEED (score 0.0 – 3.9, confidence ≥ 70%): All signals inside acceptable ranges. The deploy goes with standard monitoring.
- WATCH (score 4.0 – 6.4, or confidence 50–69%): Elevated risk on at least one axis. The deploy goes, but the agent shortens canary windows, tightens rollback thresholds, and posts a Slack alert into the on-call channel.
- HOLD (score 6.5 – 10.0, or any single signal at 9+): The agent halts the pipeline and pages the deploy author with which signals fired. A manual override requires two approvals, both logged.
Confidence reflects data completeness. If the experimentation API timed out or the incident tracker rate-limited, confidence drops — and a low-confidence PROCEED is auto-promoted to WATCH on uncertainty alone. Treat every threshold as an initial heuristic. Calibrate against your false-positive and false-negative rates.
These weights are starting points. Calibration is non-optional.
The signal weights and verdict thresholds above are illustrative defaults drawn from common deploy failure patterns. Your infrastructure topology, team size, deploy cadence, and incident distribution will need different calibration. Teams that skip the 60-90 day calibration phase consistently find blast radius is underweighted (push it from 0.25 toward 0.30-0.35) and time-to-Friday is overweighted for teams with non-Friday incident patterns. Run advisory-only first. Trust the data, not the defaults.
Engineers guess at deploy timing from vibes
Blast radius is unknown until something breaks
On-call fatigue is invisible to the person shipping
Friday deploys ride on social pressure, not data
Experiment contamination surfaces weeks later in skewed results
Incident-during-deploy response is reactive scrambling
Every deploy gets an objective score from the same six signals
Blast radius is quantified from the IaC dependency graph
On-call state is a first-class deploy input, not a manual check
Time-based risk is calculated, not negotiated
Experiment overlap is flagged before the code reaches production
High-risk windows are identified and held before the pager goes off
Wiring the Score Into the Pipeline That Already Ships
GitHub Actions, ArgoCD, custom pipelines — the integration is the same shape every time.
.github/workflows/deploy-gate.yml# Pre-deploy risk gate. Runs before any production deploy reaches the cluster.
name: Pre-Deploy Risk Gate
on:
deployment:
types: [created]
jobs:
risk-score:
runs-on: ubuntu-latest
outputs:
verdict: ${{ steps.score.outputs.verdict }}
confidence: ${{ steps.score.outputs.confidence }}
steps:
- uses: actions/checkout@v4
- name: Gather deploy context
id: context
run: |
echo "changed_services=$(gh api repos/$REPO/pulls/$PR/files | jq -r '.[].filename' | sort -u)" >> $GITHUB_OUTPUT
- name: Score the deploy
id: score
run: |
npx deploy-risk-agent \
--services "${{ steps.context.outputs.changed_services }}" \
--pagerduty-token "${{ secrets.PD_TOKEN }}" \
--launchdarkly-token "${{ secrets.LD_TOKEN }}" \
--terraform-state "s3://infra-state/prod" \
--jira-project "ENG" \
--output json > risk-report.json
echo "verdict=$(jq -r .verdict risk-report.json)" >> $GITHUB_OUTPUT
echo "confidence=$(jq -r .confidence risk-report.json)" >> $GITHUB_OUTPUT
- name: Enforce verdict
if: steps.score.outputs.verdict == 'HOLD'
run: |
echo "::error::Deploy HELD — risk score exceeded the threshold"
exit 1Calibration Is the Job. The Initial Score Is the Hypothesis.
A risk scoring system is only as good as its calibration. After every deploy — PROCEED, WATCH, or overridden HOLD — log the outcome. Did the deploy cause an incident inside 24 hours? Was a rollback required? Did any experiment results get invalidated by the change?
Store those outcomes alongside the original scores in a tracking table. After 60-90 days of outcome data, run a logistic regression against the weights and see what is actually predictive. Findings that show up over and over from teams that have done this:
- Blast radius is almost always underweighted at the start. Most teams end up moving it from 0.25 to 0.30-0.35.[4]
- Time-to-Friday is overweighted for teams whose incidents cluster on Mondays or mid-week. Tune the weight against your actual day-of-week incident distribution, not the cultural assumption.
- On-call load gets sharper when the input includes the responder's recent activity timestamps — a proxy for sleep — and not just open incident count.
Here is the counterintuitive part. The teams that get the most out of the score are not the ones deploying dozens of times a day. They are the ones deploying monthly or quarterly. High-velocity teams build deploy intuition organically — every push is feedback. Low-velocity teams have no feedback loop, so every deploy is a high-stakes event with no pattern recognition behind it. The risk score replaces the intuition the cadence never built.
Pre-Deploy Risk Score Implementation Checklist
Webhook listener wired to the deploy queue event — not polled, not batched
Experimentation platform API integrated (LaunchDarkly, Split, or Statsig)
PagerDuty or Opsgenie connected for live on-call state
Terraform or CloudFormation dependency graph parser landed and tested against staging
Incident history query against the ITSM tool — severity weighted, recency decayed
Issue tracker connected for live P2 bug counts on affected services
Time-to-Friday calculator with company holiday awareness, primary timezone configured
Initial signal weights and verdict thresholds defined as a code-reviewed config file, not hard-coded constants
CI/CD gate landed with a documented two-person override path for HOLDs
Outcome tracking table created — every verdict joined to the next 24 hours of incidents and rollbacks
Monthly weight recalibration review on the team calendar with a named owner
What if one of the signal sources is unavailable when scoring runs?
Degrade gracefully. A timed-out or errored signal source gets a score of 5 (neutral-high) and the overall confidence drops proportionally. A low-confidence PROCEED auto-promotes to WATCH. Never let a missing signal collapse to an unconditional PROCEED — that turns a degraded system into a green light, which is exactly the failure mode the score exists to prevent.
How do you handle monorepo deploys where everything looks like it changes at once?
Scope the blast radius analysis to the build targets that actually changed, not the entire repository. Bazel, Nx, and Turborepo expose affected-project graphs that map directly into dependency analysis — Nx's affected command, for one, returns the precise set of projects the commit range touches. The score should evaluate per-deployable-unit, not per-commit. One team initially scored their entire monorepo on every PR and produced HOLDs on documentation typos. Scoping to affected targets dropped false positives from roughly 40% to under 8%.
Should the score block deploys automatically or just advise?
Advisory-only for the first 30 days. Let the team see the verdicts, argue with them, and build calibration intuition against their own deploys. Once the false-positive rate drops below 10%, switch HOLDs to blocking with two-person override. WATCH stays advisory. Going to enforcement before the false-positive rate is under control trains engineers to override on reflex, which is worse than not having the gate at all.
How does the score interact with feature flags and progressive delivery?
Feature flags shrink blast radius by limiting exposure, and the score should reflect that. If a deploy is shipping behind a flag at 1% rollout, the effective blast radius is 1% of the calculated value. The agent should query the feature flag platform to apply the multiplier. The structural point: progressive delivery is itself a deploy safety mechanism, and the score should compose with it instead of double-counting risk.
The pre-deploy risk score is not a brake on velocity. Teams that ship it consistently report higher deploy frequency, because engineers deploy more often when they trust the system to flag bad timing. The score swaps anxiety for a measurement, and trades the post-incident "we should have known" for the pre-incident "the gate caught it."
Start with blast radius and incident rate. Those two signals catch most of the contextual deploy failures by themselves.[4] Add the rest as the API integrations land. Within 90 days of outcome calibration, the score is tuned to the specific shape of the infrastructure and the team that lives inside it.
The code was never the problem. The conditions were. Score them, or keep paying for them.
- [1]The Register — Overmind: The Tool That Maps Your Infrastructure's Blast Radius Before You Break It(theregister.com)↩
- [2]28mm — blast-radius: Interactive visualizations of Terraform dependency graphs(github.com)↩
- [3]IBM — Blast Radius: Review the Impact of Changes in Your Terraform Files(ibm.com)↩
- [4]Overmind — The Difference Between Terraform Plan and Overmind Blast Radius(overmind.tech)↩
- [5]Firefly — Terraform Module Blast Radius: Methods for Resilient IaC in Platform Engineering(firefly.ai)↩