Most production deploys that break did not break because of bad code. They broke because of context the deployer could not see. A pre-deploy risk score replaces gut feel with six measurable signals and a HOLD/PROCEED/WATCH verdict the pipeline enforces.
Why green CI is not the same as a safe deploy — and what the gap actually is
Six signals that capture deployment context, not just artifact quality
The blast radius estimator: BFS traversal of the IaC dependency graph
A weighted scoring formula with worked thresholds — calibrate to your incident history
GitHub Actions gate and enforcement pattern, with a safe override path
Monorepo, progressive delivery, and signal-degradation edge cases
Every team has the same incident in its postmortem archive. The Friday push that cascaded across three services. The release that landed while two P1s were already burning. The config tweak nobody realized was wired into fourteen microservices through a shared dependency.
None of those failures were code failures. They were context failures. CI was green. The diff looked surgical. The system around the deploy was the part that broke.
The pre-deploy risk score closes that gap. The agent fires the moment a deploy enters the queue, pulls signals across infrastructure and team state, and returns one of three verdicts — HOLD, PROCEED, or WATCH — with a numerical confidence and a plain-English breakdown of which signals moved the needle. Gut feel becomes a measurement. Tribal knowledge becomes a check the pipeline enforces.
Mature systems rarely ship broken code. They ship correct code into a state that cannot absorb it.
Most deployment failures in mature systems are not bad-code failures. They are context collisions — a technically valid change meeting an environment that was not ready for it. A migration that runs cleanly in staging, then deadlocks against an active A/B test that doubled write traffic on the affected table. A flag rollout that touches the same API surface as the incident remediation an on-call engineer started forty minutes earlier.
The 2024 DORA State of DevOps Report found that the high-performance cluster shrank from 31% of respondents in 2023 to 22% in 2024, while the low cluster grew from 17% to 25%.[6] That regression happened during a period of increased AI-assisted coding — which improved throughput but increased delivery instability. Teams were shipping faster into worse conditions with no mechanism to read those conditions.
Overmind, which maps dependencies across 100+ AWS resource types and Kubernetes objects to calculate blast radius, raised $6M in 2025 specifically because the problem is not the artifact — it's the infrastructure state surrounding it.[1] The pre-deploy risk score scores the deployment context, not the artifact.
Only 19% of teams hit this in 2024
Performance regression during AI adoption wave
Most teams model fewer than 20 in their mental model
Every signal is independently sourced and independently weighted. Drop any one and the verdict moves.
The agent queries the experimentation platform — LaunchDarkly, Split, Statsig — for every running experiment that touches the same services or routes the deploy modifies. Overlap is the failure mode. Deploy-induced variance corrupts experiment results, and experiment-induced traffic skew amplifies whatever side effects the deploy carries. Both directions burn weeks of analysis and a quarter of the experimentation roadmap.
A deploy landing on top of an on-call already managing two open incidents means slower response when something else breaks. The agent reads PagerDuty or Opsgenie for the current rotation and pulls open and recently-resolved incident counts over the last 48 hours. High fatigue forces a WATCH or HOLD. Responder bandwidth is a deployment signal, not a soft factor. Research teams receive over 2,000 alerts weekly with only 3% needing immediate action — on-call engineers are already context-saturated before your deploy lands.
The heaviest signal. The agent parses Terraform state, CloudFormation stacks, or Kubernetes manifests into a runtime dependency graph, then traces which downstream services, databases, and queues the modified resources actually reach. Overmind maps 100+ AWS resource types and Kubernetes objects in real time — well beyond what any team holds in their mental model.[1] A change to a shared VPC security group has nothing in common with a change to one Lambda, and the score has to reflect that.[2]
If the target service has been unstable in the past week, layering more change onto it compounds the instability. The agent pulls incident history from the ITSM tool, weights by severity, then applies a recency decay. A service with two P2s in the last 72 hours scores radically different from one that has been quiet for a quarter.
Open high-priority bugs are evidence that the codebase is already carrying instability nobody has paid down. The agent pulls Jira or Linear for open P2s tagged against the affected services. Each unresolved bug is a place where additional change is more likely to reveal something nobody scoped.
Deploying at 4:47 PM on a Friday before a holiday weekend is structurally different from deploying Tuesday at 10 AM. Not because Friday is cursed, but because the responder pool shrinks, the on-call engineer is one person, and any incident that does land has to survive the weekend on whoever happens to be carrying the pager. The agent calculates hours to end-of-business Friday, checks the company holiday calendar, and factors in the geographic spread of the on-call team. Staffing reality, not superstition.
Terraform plan shows what changes. It does not show what breaks. That distinction is everything.
Blast radius is the heaviest signal and the most technically involved component of the score. It earns its own treatment.
terraform plan tells you which resources will be modified, replaced, or destroyed. It does not tell you which downstream services depend on those resources at runtime. A security group rule can be legal to change and still knock over four services that ingress through it — services that don't appear anywhere in the plan output because Terraform only knows about the resources it manages directly.
The gap between "what will change" and "what will break" is exactly what blast radius analysis closes. Overmind's approach — querying AWS APIs in real time to build a complete dependency map rather than relying solely on IaC state — catches cross-stack references and runtime dependencies that static plan analysis misses entirely.[4]
For Terraform, the entry point is terraform graph or parsing the state file directly. The open-source blast-radius project pioneered interactive visualization of these dependency graphs in d3.js.[2] For production scoring you don't need the visualization — pipe terraform graph -type=plan into a parser that extracts nodes and edges, then run a breadth-first traversal from the changed resources outward.[3]
CloudFormation behaves differently. Stacks expose DependsOn relationships explicitly, and the AWS API returns the stack's resource list on demand. The trap is cross-stack references: when one stack exports a value another imports, the dependency is implicit but the blast radius is real. The parser has to follow Fn::ImportValue references across stack boundaries, or the score under-reads coupling that production absolutely respects.[5]
The scoring formula weights direct dependents more heavily than transitive ones, with a decay factor at each hop. A change that hits 3 services directly and reaches 12 more transitively scores differently from one that hits 12 services directly with no transitive reach. Same node count, very different blast radius.
How the weighted sum collapses to HOLD, WATCH, or PROCEED — and where the thresholds bite.
| Signal | Weight | Low (0-3) | Medium (4-6) | High (7-10) |
|---|---|---|---|---|
| Blast Radius | 0.25 | ≤2 direct deps | 3-8 direct deps |
|
| 7-Day Incident Rate | 0.20 | 0-1 incidents | 2-3 incidents | 4+ or any P1 |
| On-Call Load | 0.15 | 0 open incidents | 1-2 open incidents | 3+ or recent P1 |
| A/B Experiments | 0.15 | 0 overlapping | 1-2 overlapping | 3+ overlapping |
| Outstanding P2 Bugs | 0.10 | 0-1 open | 2-4 open | 5+ open |
| Time-to-Friday | 0.15 |
| 8-24 hours | <8 hours |
The final score is a weighted sum of normalized sub-scores between 0 and 10. The verdict mapping below is a starting point — calibrate the thresholds against your own incident history after 60-90 days of outcome data:
Confidence reflects data completeness. If the experimentation API timed out or the incident tracker rate-limited, confidence drops — and a low-confidence PROCEED is auto-promoted to WATCH on uncertainty alone. Treat every threshold as an initial heuristic. Calibrate against your false-positive and false-negative rates.
Worked example: A payment service deploy on a Thursday at 3 PM. Blast radius: 5 direct dependents → sub-score 5.5 (medium). Seven-day incidents: 1 P2 → sub-score 3.0. On-call load: 1 open → sub-score 4.0. Active A/B tests: 0 overlap → sub-score 1.0. Outstanding P2 bugs: 2 → sub-score 4.5. Time-to-Friday: 26 hours → sub-score 2.0. Weighted sum: (5.5×0.25) + (3.0×0.20) + (4.0×0.15) + (1.0×0.15) + (4.5×0.10) + (2.0×0.15) = 1.375 + 0.60 + 0.60 + 0.15 + 0.45 + 0.30 = 3.475 → PROCEED. Same deploy at 4 PM Friday with 2 open incidents: time-to-Friday → sub-score 8.0, on-call → sub-score 7.5. Weighted sum jumps to 5.8 → WATCH.
The signal weights and verdict thresholds above are illustrative defaults drawn from common deploy failure patterns. Your infrastructure topology, team size, deploy cadence, and incident distribution will need different calibration. Teams that skip the 60-90 day calibration phase consistently find blast radius is underweighted (push it from 0.25 toward 0.30-0.35) and time-to-Friday is overweighted for teams whose incidents cluster on Mondays or mid-week. Run advisory-only first. Trust the data, not the defaults.
Engineers guess at deploy timing from vibes
Blast radius is unknown until something breaks
On-call fatigue is invisible to the person shipping
Friday deploys ride on social pressure, not data
Experiment contamination surfaces weeks later in skewed results
Incident-during-deploy response is reactive scrambling
Every deploy gets an objective score from the same six signals
Blast radius is quantified from the IaC dependency graph
On-call state is a first-class deploy input, not a manual check
Time-based risk is calculated, not negotiated
Experiment overlap is flagged before the code reaches production
High-risk windows are identified and held before the pager goes off
GitHub Actions, ArgoCD, custom pipelines — the integration is the same shape every time.
The gate runs as a blocking step between "build succeeded" and "deploy to production." It's not a separate process or a Slack bot that posts a warning someone ignores. The verdict is enforced by the pipeline runner — an exit code of 1 is the only mechanism that reliably stops a deploy in CI.
For ArgoCD, the pattern is a pre-sync hook: a Kubernetes Job that runs the scoring agent against the target application before ArgoCD applies the sync. If the Job exits non-zero, the sync is blocked. For Spinnaker, the equivalent is a Precondition stage with a script condition. For GitHub Actions, the gate runs as its own job that the deploy job depends on — the needs field does the enforcement.
Monorepos, progressive delivery, and degraded signal sources each require specific handling or the score misbehaves.
Monorepos: The single worst mistake is scoring the entire repository on every PR. One team initially evaluated their full monorepo on every commit and produced HOLDs on documentation typo fixes. The blast radius analyzer read thousands of resource dependencies that had nothing to do with the change. Scoping to Nx's affected command — which returns the precise set of projects a commit range touches — dropped their false-positive rate from roughly 40% to under 8%. Bazel and Turborepo expose equivalent affected-graph APIs. The rule: score the deployable unit, not the repository.
Progressive delivery: Feature flags shrink blast radius by limiting exposure. A deploy behind a 1% canary flag has an effective blast radius of 1% of the calculated value — the risk is real but bounded. The scoring agent should query the feature flag platform and apply a rollout multiplier before scoring. The structural point: progressive delivery is itself a deploy safety mechanism; the score should compose with it rather than double-count risk.[7]
Degraded signal sources: APIs time out. Rate limits hit at the worst moments. A scoring system that returns PROCEED on any signal failure is a liability — it trains engineers to trust a verdict that was assembled from incomplete data. The correct behavior: a timed-out or errored signal source gets a neutral-high score of 5 (not 0), and overall confidence drops proportionally. A sub-50% confidence PROCEED auto-promotes to HOLD. Make the degradation visible in the PR comment so engineers understand they're operating with partial information.
Never collapse a missing signal to 0. Absence of data is not evidence of safety.
A half-scored deploy is not a safe deploy. Surface the gap, don't hide it.
A single extreme signal overrides the aggregate. One burning building is enough.
Blast radius is the heaviest signal. If the IaC state is unavailable, the score is invalid.
A risk scoring system is only as good as its calibration. After every deploy — PROCEED, WATCH, or overridden HOLD — log the outcome. Did the deploy cause an incident inside 24 hours? Was a rollback required? Did any experiment results get invalidated by the change?
Store those outcomes alongside the original scores in a tracking table. After 60-90 days of outcome data, run a logistic regression against the weights and see what is actually predictive. Findings that surface consistently across teams that have done this:
The counterintuitive finding: teams that get the most out of the score are not the ones deploying dozens of times a day. They're the ones deploying monthly or quarterly. High-velocity teams build deploy intuition organically — every push is feedback. Low-velocity teams have no feedback loop, so every deploy is a high-stakes event with no pattern recognition behind it. The risk score replaces the intuition the cadence never built.
| Field | Type | Purpose |
|---|---|---|
| deploy_id | string | Links verdict to the deploy artifact |
| verdict | HOLD | WATCH | PROCEED | The scoring system's output |
| score | float 0-10 | Weighted sum at time of scoring |
| confidence | float 0-100 | Data completeness at scoring time |
| overridden | boolean | Was a HOLD manually overridden? |
| override_justification | string | Required field if overridden |
| incidentwithin24h | boolean | Did an incident follow within 24 hours? |
| rollback_required | boolean | Was a rollback executed? |
| experiment_invalidated | boolean | Were A/B results corrupted? |
| signal_scores | JSON | Individual sub-scores for regression input |
What if one of the signal sources is unavailable when scoring runs?
Degrade gracefully with a neutral-high score. A timed-out or errored signal source gets a score of 5 — not 0, which would falsely depress the verdict. Overall confidence drops proportionally, and a sub-50% confidence PROCEED auto-promotes to HOLD. Never let a missing signal collapse to an unconditional PROCEED — that turns a degraded system into a green light, which is exactly the failure mode the score exists to prevent. The PR comment should always disclose which signals scored from live data and which are estimated.
How do you handle monorepo deploys where everything looks like it changes at once?
Scope the blast radius analysis to the build targets that actually changed, not the entire repository. Nx's affected command, Bazel's query --universe_scope, and Turborepo's --filter flag all return the precise set of projects a commit range touches — these map directly into blast radius analysis. One team scored their entire monorepo on every PR and produced HOLDs on documentation typos. Scoping to affected targets dropped false positives from roughly 40% to under 8%. The rule: evaluate per deployable unit, not per commit.
Should the score block deploys automatically or just advise?
Advisory-only for the first 30 days. Let the team see the verdicts, argue with them, and build calibration intuition against their own deploys. Once the false-positive rate drops below 10%, switch HOLDs to blocking with two-person override. WATCH stays advisory. Going to enforcement before the false-positive rate is under control trains engineers to override on reflex — which is worse than not having the gate at all. The enforcement escalation is: advisory → blocking-with-override → blocking-with-review-gate.
How does the score interact with feature flags and progressive delivery?
Feature flags shrink blast radius by limiting exposure, and the score should reflect that. A deploy shipping behind a 1% canary flag has an effective blast radius of 1% of the calculated value. The agent should query the feature flag platform and apply a rollout percentage multiplier to the blast radius sub-score. Progressive delivery is itself a deploy safety mechanism — the risk score composes with it rather than double-counting the underlying change risk.
When should blast radius be weighted higher than the default 0.25?
Most teams end up pushing blast radius toward 0.30-0.35 after calibration. Move it higher if your incidents tend to originate from shared infrastructure changes (security groups, IAM, VPCs, shared config) rather than application code. Move it lower if your services are genuinely isolated — bounded contexts with no cross-service dependencies — though that topology is rarer than teams believe. Overmind's real-world data shows runtime dependencies typically run 3-5x wider than the dependency tree teams maintain mentally.
What's the fastest path to a working first version?
Start with two signals: blast radius from your IaC state and 7-day incident rate from your ITSM tool. Those two signals catch the majority of contextual deploy failures on their own. Wire them into a GitHub Actions gate in advisory mode. Add on-call load after 30 days. Add the remaining three signals once you have outcome data to calibrate against. The full six-signal system is the goal, but a two-signal advisory gate running Monday is more valuable than a perfect six-signal enforcer that ships in Q3.
The pre-deploy risk score is not a brake on velocity. Teams that implement it consistently report higher deploy frequency — engineers ship more often when they trust the system to flag bad timing. The score swaps anxiety for a measurement, and trades the post-incident "we should have known" for the pre-incident "the gate caught it."
Start with blast radius and incident rate. Those two signals catch most contextual deploy failures.[4] Add the rest as the API integrations land. Within 90 days of outcome calibration, the score reflects the specific shape of your infrastructure and the team that lives inside it.
The code was never the problem. The conditions were. Score them.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.