Pre-Deploy Risk Score: The Gate That Reads Context

Q: What if one of the signal sources is unavailable when scoring runs?

Degrade gracefully. A timed-out or errored signal source gets a score of 5 (neutral-high) and the overall confidence drops proportionally. A low-confidence PROCEED auto-promotes to WATCH. Never let a missing signal collapse to an unconditional PROCEED — that turns a degraded system into a green light, which is exactly the failure mode the score exists to prevent.

Q: How do you handle monorepo deploys where everything looks like it changes at once?

Scope the blast radius analysis to the build targets that actually changed, not the entire repository. Bazel, Nx, and Turborepo expose affected-project graphs that map directly into dependency analysis — Nx's `affected` command, for one, returns the precise set of projects the commit range touches. The score should evaluate per-deployable-unit, not per-commit. One team initially scored their entire monorepo on every PR and produced HOLDs on documentation typos. Scoping to affected targets dropped false positives from roughly 40% to under 8%.

Q: Should the score block deploys automatically or just advise?

Advisory-only for the first 30 days. Let the team see the verdicts, argue with them, and build calibration intuition against their own deploys. Once the false-positive rate drops below 10%, switch HOLDs to blocking with two-person override. WATCH stays advisory. Going to enforcement before the false-positive rate is under control trains engineers to override on reflex, which is worse than not having the gate at all.

Q: How does the score interact with feature flags and progressive delivery?

Feature flags shrink blast radius by limiting exposure, and the score should reflect that. If a deploy is shipping behind a flag at 1% rollout, the effective blast radius is 1% of the calculated value. The agent should query the feature flag platform to apply the multiplier. The structural point: progressive delivery is itself a deploy safety mechanism, and the score should compose with it instead of double-counting risk.

The Code Was Fine. The Timing Was the Incident.

Most production deploys that break did not break because of bad code. They broke because of context the deployer could not see. A pre-deploy risk score replaces gut feel with six measurable signals and a HOLD/PROCEED/WATCH verdict the pipeline enforces.

Governance & AdoptionintermediateDec 1, 20255 min read

By Viktor Bezdek · VP Engineering, Groupon

Every team has the same incident in its postmortem archive. The Friday push that cascaded across three services. The release that landed while two P1s were already burning. The config tweak nobody realized was wired into fourteen microservices through a shared dependency.

None of those failures were code failures. They were context failures. CI was green. The diff looked surgical. The system around the deploy was the part that broke.

The pre-deploy risk score closes that gap. The agent fires the moment a deploy enters the queue, pulls signals across infrastructure and team state, and returns one of three verdicts — HOLD, PROCEED, or WATCH — with a numerical confidence and a plain-English breakdown of which signals moved the needle. Gut feel becomes a measurement. Tribal knowledge becomes a check the pipeline enforces.

Green CI Is Not the Same Thing as a Safe Deploy

Mature systems rarely ship broken code. They ship correct code into a state that cannot absorb it.

Most deployment failures in mature systems are not bad-code failures. They are context collisions — a technically valid change meeting an environment that was not ready for it. A migration that runs cleanly in staging, then deadlocks against an active A/B test that doubled write traffic on the affected table. A flag rollout that touches the same API surface as the incident remediation an on-call engineer started forty minutes earlier.

A 2025 Overmind analysis found that a substantial share of deploy-related incidents involved infrastructure dependencies the deploying engineer did not know existed^[1] — with estimates that more than half of contextual failures are predictable given better dependency visibility.^[4] The artifact was sound. The conditions around it were not. The pre-deploy risk score scores the deployment context, not the artifact.

7-day

Incident rate window

A/B tests

Active experiment overlap

Blast radius

Dependency graph reach

Time-to-Friday

Staffing-window penalty

P2 bugs

Outstanding instability count

On-call load

Responder fatigue index

Six Signals. Each One Measures a Different Failure Class.

Every signal is independently sourced and independently weighted. Drop any one and the verdict moves.

[01]
Active A/B Experiments
The agent queries the experimentation platform — LaunchDarkly, Split, Statsig — for every running experiment that touches the same services or routes the deploy modifies. Overlap is the failure mode. Deploy-induced variance corrupts experiment results, and experiment-induced traffic skew amplifies whatever side effects the deploy carries. Both directions burn weeks of analysis and a quarter of the experimentation roadmap.
[02]
On-Call Engineer Incident Load
A deploy landing on top of an on-call already managing two open incidents means slower response when something else breaks. The agent reads PagerDuty or Opsgenie for the current rotation and pulls open and recently-resolved incident counts over the last 48 hours. High fatigue forces a WATCH or HOLD. Responder bandwidth is a deployment signal, not a soft factor.
[03]
Blast Radius from the Dependency Graph
The heaviest signal. The agent parses Terraform state, CloudFormation stacks, or Kubernetes manifests into a runtime dependency graph, then traces which downstream services, databases, and queues the modified resources actually reach. A change to a shared VPC security group has nothing in common with a change to one Lambda, and the score has to reflect that.^[2]
[04]
Seven-Day Incident Rate
If the target service has been unstable in the past week, layering more change onto it compounds the instability. The agent pulls incident history from the ITSM tool, weights by severity, then applies a recency decay. A service with two P2s in the last 72 hours scores radically different from one that has been quiet for a quarter.
[05]
Outstanding P2 Bugs
Open high-priority bugs are evidence that the codebase is already carrying instability nobody has paid down. The agent pulls Jira or Linear for open P2s tagged against the affected services. Each unresolved bug is a place where additional change is more likely to reveal something nobody scoped.
[06]
Time-to-Friday and Calendar Pressure
Deploying at 4:47 PM on a Friday before a holiday weekend is structurally different from deploying Tuesday at 10 AM. Not because Friday is cursed, but because the responder pool shrinks, the on-call engineer is one person, and any incident that does land has to survive the weekend on whoever happens to be carrying the pager. The agent calculates hours to end-of-business Friday, checks the company holiday calendar, and factors in the geographic spread of the on-call team. Staffing reality, not superstition.

Pre-Deploy Risk Score — Signal Fan-Out, Verdict Fan-In

The deploy queue event fans out to six signal collectors in parallel. The aggregator weights them, the verdict node routes to HOLD, WATCH, or PROCEED. Any missing signal blocks scoring rather than defaulting to PROCEED.

Building the Blast Radius Estimator From IaC State

Pull the dependency graph from Terraform and CloudFormation. Score what the change actually reaches.

Blast radius is the heaviest signal and the most technically involved component of the score. It earns its own treatment. The mechanism is straightforward: parse Infrastructure-as-Code state into a directed acyclic graph of resource dependencies, then count how many nodes are reachable from the set of resources the deploy modifies.

For Terraform, the entry point is terraform graph or parsing the state file directly. The open-source blast-radius project pioneered interactive visualization of these dependency graphs in d3.js.^[2] For production scoring you do not need the visualization — pipe terraform graph -type=plan into a parser that extracts nodes and edges, then run a breadth-first traversal from the changed resources outward.^[3]

CloudFormation behaves differently. Stacks expose DependsOn relationships explicitly, and the AWS API returns the stack's resource list on demand. The trap is cross-stack references: when one stack exports a value another imports, the dependency is implicit but the blast radius is real. The parser has to follow Fn::ImportValue references across stack boundaries, or the score under-reads coupling that production absolutely respects.^[5]

The scoring formula weights direct dependents more heavily than transitive ones, with a decay factor at each hop. A change that hits 3 services directly and reaches 12 more transitively scores differently from one that hits 12 services directly with no transitive reach. Same node count, very different blast radius.

blast-radius-scorer.ts

interface DependencyNode {
  resourceId: string;
  resourceType: string;
  directDependents: string[];
  transitiveDependents: string[];
}

function calculateBlastRadius(
  changedResources: string[],
  graph: Map<string, DependencyNode>
): { score: number; affectedServices: string[]; depth: number } {
  const visited = new Set<string>();
  const queue: { id: string; depth: number }[] = [];
  let maxDepth = 0;

  // Seed the BFS with the resources the deploy actually touches
  for (const id of changedResources) {
    queue.push({ id, depth: 0 });
    visited.add(id);
  }

  // Walk outward through the dependency graph
  let score = 0;
  while (queue.length > 0) {
    const { id, depth } = queue.shift()!;
    const node = graph.get(id);
    if (!node) continue;

    // Decay: direct = 1.0, each hop multiplies by 0.6
    const depthWeight = Math.pow(0.6, depth);
    score += depthWeight;
    maxDepth = Math.max(maxDepth, depth);

    for (const dep of node.directDependents) {
      if (!visited.has(dep)) {
        visited.add(dep);
        queue.push({ id: dep, depth: depth + 1 });
      }
    }
  }

  return {
    score: Math.round(score * 100) / 100,
    affectedServices: [...visited],
    depth: maxDepth,
  };
}

Six Sub-Scores, One Verdict, Zero Ambiguity

How the weighted sum collapses to HOLD, WATCH, or PROCEED — and where the thresholds bite.

Signal	Weight	Low (0-3)	Medium (4-6)	High (7-10)
Blast Radius	0.25	≤2 direct deps	3-8 direct deps	8 or cross-region
7-Day Incident Rate	0.20	0-1 incidents	2-3 incidents	4+ or any P1
On-Call Load	0.15	0 open incidents	1-2 open incidents	3+ or recent P1
A/B Experiments	0.15	0 overlapping	1-2 overlapping	3+ overlapping
Outstanding P2 Bugs	0.10	0-1 open	2-4 open	5+ open
Time-to-Friday	0.15	24 hours	8-24 hours	<8 hours

The final score is a weighted sum of normalized sub-scores between 0 and 10. The verdict mapping below is a starting point — calibrate the thresholds against your own incident history after 60-90 days of outcome data:

PROCEED (score 0.0 – 3.9, confidence ≥ 70%): All signals inside acceptable ranges. The deploy goes with standard monitoring.
WATCH (score 4.0 – 6.4, or confidence 50–69%): Elevated risk on at least one axis. The deploy goes, but the agent shortens canary windows, tightens rollback thresholds, and posts a Slack alert into the on-call channel.
HOLD (score 6.5 – 10.0, or any single signal at 9+): The agent halts the pipeline and pages the deploy author with which signals fired. A manual override requires two approvals, both logged.

Confidence reflects data completeness. If the experimentation API timed out or the incident tracker rate-limited, confidence drops — and a low-confidence PROCEED is auto-promoted to WATCH on uncertainty alone. Treat every threshold as an initial heuristic. Calibrate against your false-positive and false-negative rates.

These weights are starting points. Calibration is non-optional.

The signal weights and verdict thresholds above are illustrative defaults drawn from common deploy failure patterns. Your infrastructure topology, team size, deploy cadence, and incident distribution will need different calibration. Teams that skip the 60-90 day calibration phase consistently find blast radius is underweighted (push it from 0.25 toward 0.30-0.35) and time-to-Friday is overweighted for teams with non-Friday incident patterns. Run advisory-only first. Trust the data, not the defaults.

Gut Feel

Engineers guess at deploy timing from vibes
Blast radius is unknown until something breaks
On-call fatigue is invisible to the person shipping
Friday deploys ride on social pressure, not data
Experiment contamination surfaces weeks later in skewed results
Incident-during-deploy response is reactive scrambling

Measurement

Every deploy gets an objective score from the same six signals
Blast radius is quantified from the IaC dependency graph
On-call state is a first-class deploy input, not a manual check
Time-based risk is calculated, not negotiated
Experiment overlap is flagged before the code reaches production
High-risk windows are identified and held before the pager goes off

Wiring the Score Into the Pipeline That Already Ships

GitHub Actions, ArgoCD, custom pipelines — the integration is the same shape every time.

.github/workflows/deploy-gate.yml

# Pre-deploy risk gate. Runs before any production deploy reaches the cluster.
name: Pre-Deploy Risk Gate
on:
  deployment:
    types: [created]

jobs:
  risk-score:
    runs-on: ubuntu-latest
    outputs:
      verdict: ${{ steps.score.outputs.verdict }}
      confidence: ${{ steps.score.outputs.confidence }}
    steps:
      - uses: actions/checkout@v4

      - name: Gather deploy context
        id: context
        run: |
          echo "changed_services=$(gh api repos/$REPO/pulls/$PR/files | jq -r '.[].filename' | sort -u)" >> $GITHUB_OUTPUT

      - name: Score the deploy
        id: score
        run: |
          npx deploy-risk-agent \
            --services "${{ steps.context.outputs.changed_services }}" \
            --pagerduty-token "${{ secrets.PD_TOKEN }}" \
            --launchdarkly-token "${{ secrets.LD_TOKEN }}" \
            --terraform-state "s3://infra-state/prod" \
            --jira-project "ENG" \
            --output json > risk-report.json

          echo "verdict=$(jq -r .verdict risk-report.json)" >> $GITHUB_OUTPUT
          echo "confidence=$(jq -r .confidence risk-report.json)" >> $GITHUB_OUTPUT

      - name: Enforce verdict
        if: steps.score.outputs.verdict == 'HOLD'
        run: |
          echo "::error::Deploy HELD — risk score exceeded the threshold"
          exit 1

Calibration Is the Job. The Initial Score Is the Hypothesis.

A risk scoring system is only as good as its calibration. After every deploy — PROCEED, WATCH, or overridden HOLD — log the outcome. Did the deploy cause an incident inside 24 hours? Was a rollback required? Did any experiment results get invalidated by the change?

Store those outcomes alongside the original scores in a tracking table. After 60-90 days of outcome data, run a logistic regression against the weights and see what is actually predictive. Findings that show up over and over from teams that have done this:

Blast radius is almost always underweighted at the start. Most teams end up moving it from 0.25 to 0.30-0.35.^[4]
Time-to-Friday is overweighted for teams whose incidents cluster on Mondays or mid-week. Tune the weight against your actual day-of-week incident distribution, not the cultural assumption.
On-call load gets sharper when the input includes the responder's recent activity timestamps — a proxy for sleep — and not just open incident count.

Here is the counterintuitive part. The teams that get the most out of the score are not the ones deploying dozens of times a day. They are the ones deploying monthly or quarterly. High-velocity teams build deploy intuition organically — every push is feedback. Low-velocity teams have no feedback loop, so every deploy is a high-stakes event with no pattern recognition behind it. The risk score replaces the intuition the cadence never built.

Pre-Deploy Risk Score Implementation Checklist

Webhook listener wired to the deploy queue event — not polled, not batched
Experimentation platform API integrated (LaunchDarkly, Split, or Statsig)
PagerDuty or Opsgenie connected for live on-call state
Terraform or CloudFormation dependency graph parser landed and tested against staging
Incident history query against the ITSM tool — severity weighted, recency decayed
Issue tracker connected for live P2 bug counts on affected services
Time-to-Friday calculator with company holiday awareness, primary timezone configured
Initial signal weights and verdict thresholds defined as a code-reviewed config file, not hard-coded constants
CI/CD gate landed with a documented two-person override path for HOLDs
Outcome tracking table created — every verdict joined to the next 24 hours of incidents and rollbacks
Monthly weight recalibration review on the team calendar with a named owner

What if one of the signal sources is unavailable when scoring runs?

Degrade gracefully. A timed-out or errored signal source gets a score of 5 (neutral-high) and the overall confidence drops proportionally. A low-confidence PROCEED auto-promotes to WATCH. Never let a missing signal collapse to an unconditional PROCEED — that turns a degraded system into a green light, which is exactly the failure mode the score exists to prevent.

How do you handle monorepo deploys where everything looks like it changes at once?

Scope the blast radius analysis to the build targets that actually changed, not the entire repository. Bazel, Nx, and Turborepo expose affected-project graphs that map directly into dependency analysis — Nx's affected command, for one, returns the precise set of projects the commit range touches. The score should evaluate per-deployable-unit, not per-commit. One team initially scored their entire monorepo on every PR and produced HOLDs on documentation typos. Scoping to affected targets dropped false positives from roughly 40% to under 8%.

Should the score block deploys automatically or just advise?

Advisory-only for the first 30 days. Let the team see the verdicts, argue with them, and build calibration intuition against their own deploys. Once the false-positive rate drops below 10%, switch HOLDs to blocking with two-person override. WATCH stays advisory. Going to enforcement before the false-positive rate is under control trains engineers to override on reflex, which is worse than not having the gate at all.

How does the score interact with feature flags and progressive delivery?

Feature flags shrink blast radius by limiting exposure, and the score should reflect that. If a deploy is shipping behind a flag at 1% rollout, the effective blast radius is 1% of the calculated value. The agent should query the feature flag platform to apply the multiplier. The structural point: progressive delivery is itself a deploy safety mechanism, and the score should compose with it instead of double-counting risk.

The pre-deploy risk score is not a brake on velocity. Teams that ship it consistently report higher deploy frequency, because engineers deploy more often when they trust the system to flag bad timing. The score swaps anxiety for a measurement, and trades the post-incident "we should have known" for the pre-incident "the gate caught it."

Start with blast radius and incident rate. Those two signals catch most of the contextual deploy failures by themselves.^[4] Add the rest as the API integrations land. Within 90 days of outcome calibration, the score is tuned to the specific shape of the infrastructure and the team that lives inside it.

The code was never the problem. The conditions were. Score them, or keep paying for them.

Key terms in this piece

pre-deploy risk scoredeployment risk assessmentblast radius analysisCI/CD safetydeploy gateinfrastructure dependency graphon-call fatiguedeployment automation

Sources

[1]The Register — Overmind: The Tool That Maps Your Infrastructure's Blast Radius Before You Break It(theregister.com)↩
[2]28mm — blast-radius: Interactive visualizations of Terraform dependency graphs(github.com)↩
[3]IBM — Blast Radius: Review the Impact of Changes in Your Terraform Files(ibm.com)↩
[4]Overmind — The Difference Between Terraform Plan and Overmind Blast Radius(overmind.tech)↩
[5]Firefly — Terraform Module Blast Radius: Methods for Resilient IaC in Platform Engineering(firefly.ai)↩

Share this article

X LinkedIn Hacker News

The Code Was Fine. The Timing Was the Incident.

Governance & AdoptionintermediateDec 1, 20255 min read

By Viktor Bezdek · VP Engineering, Groupon

interface DependencyNode { resourceId: string; resourceType: string; directDependents: string[]; transitiveDependents: string[]; } function calculateBlastRadius( changedResources: string[], graph: Map<string, DependencyNode> ): { score: number; affectedServices: string[]; depth: number } { const visited = new Set<string>(); const queue: { id: string; depth: number }[] = []; let maxDepth = 0; // Seed the BFS with the resources the deploy actually touches for (const id of changedResources) { queue.push({ id, depth: 0 }); visited.add(id); } // Walk outward through the dependency graph let score = 0; while (queue.length > 0) { const { id, depth } = queue.shift()!; const node = graph.get(id); if (!node) continue; // Decay: direct = 1.0, each hop multiplies by 0.6 const depthWeight = Math.pow(0.6, depth); score += depthWeight; maxDepth = Math.max(maxDepth, depth); for (const dep of node.directDependents) { if (!visited.has(dep)) { visited.add(dep); queue.push({ id: dep, depth: depth + 1 }); } } } return { score: Math.round(score * 100) / 100, affectedServices: [...visited], depth: maxDepth, }; }

Signal

Weight

Low (0-3)

Medium (4-6)

High (7-10)

Blast Radius

0.25

≤2 direct deps

3-8 direct deps

8 or cross-region

7-Day Incident Rate

0.20

0-1 incidents

2-3 incidents

4+ or any P1

On-Call Load

0.15

0 open incidents

1-2 open incidents

3+ or recent P1

A/B Experiments

0.15

0 overlapping

1-2 overlapping

3+ overlapping

Outstanding P2 Bugs

0.10

0-1 open

2-4 open

5+ open

Time-to-Friday

0.15

24 hours

8-24 hours

<8 hours

# Pre-deploy risk gate. Runs before any production deploy reaches the cluster. name: Pre-Deploy Risk Gate on: deployment: types: [created] jobs: risk-score: runs-on: ubuntu-latest outputs: verdict: ${{ steps.score.outputs.verdict }} confidence: ${{ steps.score.outputs.confidence }} steps: - uses: actions/checkout@v4 - name: Gather deploy context id: context run: | echo "changed_services=$(gh api repos/$REPO/pulls/$PR/files | jq -r '.[].filename' | sort -u)" >> $GITHUB_OUTPUT - name: Score the deploy id: score run: | npx deploy-risk-agent \ --services "${{ steps.context.outputs.changed_services }}" \ --pagerduty-token "${{ secrets.PD_TOKEN }}" \ --launchdarkly-token "${{ secrets.LD_TOKEN }}" \ --terraform-state "s3://infra-state/prod" \ --jira-project "ENG" \ --output json > risk-report.json echo "verdict=$(jq -r .verdict risk-report.json)" >> $GITHUB_OUTPUT echo "confidence=$(jq -r .confidence risk-report.json)" >> $GITHUB_OUTPUT - name: Enforce verdict if: steps.score.outputs.verdict == 'HOLD' run: | echo "::error::Deploy HELD — risk score exceeded the threshold" exit 1

The Code Was Fine. The Timing Was the Incident.

Green CI Is Not the Same Thing as a Safe Deploy

Six Signals. Each One Measures a Different Failure Class.

Active A/B Experiments

On-Call Engineer Incident Load

Blast Radius from the Dependency Graph

Seven-Day Incident Rate

Outstanding P2 Bugs

Time-to-Friday and Calendar Pressure

Building the Blast Radius Estimator From IaC State

Six Sub-Scores, One Verdict, Zero Ambiguity

These weights are starting points. Calibration is non-optional.

Wiring the Score Into the Pipeline That Already Ships

Calibration Is the Job. The Initial Score Is the Hypothesis.

Pre-Deploy Risk Score Implementation Checklist

Related

The Code Was Fine. The Timing Was the Incident.

Green CI Is Not the Same Thing as a Safe Deploy

Six Signals. Each One Measures a Different Failure Class.

Active A/B Experiments

On-Call Engineer Incident Load

Blast Radius from the Dependency Graph

Seven-Day Incident Rate

Outstanding P2 Bugs

Time-to-Friday and Calendar Pressure

Building the Blast Radius Estimator From IaC State

Six Sub-Scores, One Verdict, Zero Ambiguity

These weights are starting points. Calibration is non-optional.

Wiring the Score Into the Pipeline That Already Ships

Calibration Is the Job. The Initial Score Is the Hypothesis.

Pre-Deploy Risk Score Implementation Checklist

Related