Four signal layers, scored monthly per service, produce a fragility register that names your next outage weeks before it happens. Size is not risk. Neglect is risk. The heat map measures neglect.
Why complexity metrics fail to predict outages — and what does predict them
Four signal layers: incident history, codebase health, ADR audit, PR friction
Exact scoring formulas and weight defaults you can run Monday morning
How to build the fragility register and enforce it in leadership cadences
Calibration: how to tune weights after 90 days of your own data
Common failure modes — including the gaming problem — and how to handle them
Four signal layers — incident history, codebase health, ADR audit gaps, PR review friction — scored monthly per service. The output is a ranked fragility register that names your next outage weeks before it happens. The riskiest service is almost never the largest one. It is usually the one nobody has touched.
Most engineering orgs find their fragile services through a P1. The knowledge already existed — scattered across incident channels, abandoned ADRs, the instincts of senior engineers who have been burned before. Nobody synthesizes it. The heat map is the synthesis.
First time we ran this scoring on a mid-size platform, the top-fragility service was a 4,000-line payments adapter nobody had touched in eight months. Zero recent incidents. Looked clean. The signals said otherwise: 9 TODOs per KLOC pointing at a deprecated API, 18% coverage on recently-modified files, PR review cycles averaging 7 rounds. Three months later it caused a two-hour outage during peak traffic. The heat map had flagged it as critical for six weeks before that.
Large codebases are not inherently fragile. Small services with accumulated neglect are.
Engineering teams confuse size with risk. The 200,000-line monolith gets the attention. Meanwhile a 3,000-line billing service with zero tests, four unresolved TODOs pointing at a deprecated API, and a dependency three majors behind quietly accumulates blast radius.
Meta's Diff Risk Score research, published August 2025[1], showed predictive models trained on historical incident correlation outperform models trained on code complexity metrics. Lines of code do not predict breakage. Cyclomatic complexity does not predict breakage. Patterns of neglect, uncertainty, and unresolved decisions predict breakage.
The research on source code hotspots confirms this: studies presented at the 2026 Mining Software Repositories conference found that critical hotspots — fewer than 5% of a codebase — account for more than 50% of recorded defects[5]. The signal is concentrated. You do not need to fix everything; you need to find where the concentration is.
The heat map captures exactly those signals. It asks one question: which services have the highest concentration of deferred maintenance, unreviewed architectural bets, and review-cycle friction? Those are the services that will surprise you next.
Incident history is the strongest single predictor. Weight it accordingly.
Pull 90 days of incident data from the ITSM platform — PagerDuty, Opsgenie, Rootly, FireHydrant. Aggregate by affected service. Raw counts mislead. A service with ten P4 informational alerts is not more fragile than one with two P1 outages.
Weight the score by severity and recency. P1 in the last 30 days: 10 points each. P2: 5. P3: 2. Decay every signal by 50% per 30-day period. An incident from 60 days ago counts half as much as one from last week. The result is a recency-weighted severity score that captures both the magnitude and the trajectory.
This layer also tags incident adjacency. When Service A goes down because Service B failed upstream, both services accumulate score — but with different tags: "origin" versus "blast-radius victim." That distinction matters when you decide where to invest remediation. Hardening a victim does not stop the origin from breaking again.
One pattern worth flagging: services with zero recent incidents but high scores on layers 2-4 are the most dangerous category. They have not broken yet. That does not mean they are stable — it often means they are under-monitored. When this pattern appears, increase the weight of codebase health and ADR signals by 1.5x for that service until alerting improves.
| Severity | Base Score | 0–30 Days | 30–60 Days | 60–90 Days |
|---|---|---|---|---|
| P1 — Full outage | 10 | 10.0 | 5.0 | 2.5 |
| P2 — Degraded service | 5 | 5.0 | 2.5 | 1.25 |
| P3 — Minor impact | 2 | 2.0 | 1.0 | 0.5 |
| P4 — Informational | 1 | 1.0 | 0.5 | 0.25 |
Static signals extracted from the repo. TODOs, coverage gaps, dependency drift — each one a compounding liability.
Layer two mines the codebase directly. Three sub-signals combine into a single health score.
TODO/FIXME/HACK concentration. Count debt markers per thousand lines of code. A service at 0.5 TODOs per KLOC is maintained. A service at 8 per KLOC is accumulating shortcuts nobody comes back to. Weight markers that reference a specific issue number or date more heavily — those are known problems with deferred resolution, not vague intent.
Test coverage gap. Overall coverage is a blunt instrument. Measure coverage on files modified in the last 90 days instead. A service with 85% overall coverage but 20% coverage on recently-modified files is more fragile than one with 60% overall coverage that is consistently tested where it changes. The delta between overall and active-file coverage is the signal.
Dependency staleness. For each service, calculate the average age of direct dependencies versus the latest available versions. The mean time between a CVE's public disclosure and its first observed exploitation in the wild dropped from 745 days in 2020 to roughly 44 days by 2025[6]. A service pinned 18 months behind on security-critical dependencies is not a future problem — it is a liability with a shrinking response window. Weight dependencies with known CVEs in the gap at 3×.
These three signals can be automated from existing tooling. SonarQube or custom grep scripts for TODO density, Istanbul or coverage.py for the coverage delta, Dependabot or Snyk for staleness and CVE exposure.
Architecture decisions accumulate as bets. Unrevisited bets become landmines.
Architecture Decision Records are useful when maintained[2]. They become dangerous when abandoned. Layer three parses the ADR repo and flags two failure classes.
Unresolved revisit markers. ADRs use phrases like "revisit after migration completes," "temporary until we evaluate alternatives in Q3," or "accepted risk — reassess in 6 months." The agent scans for those temporal markers and checks whether any follow-up ADR or PR addressed them. An ADR from 18 months ago that says "revisit in Q2 2025" with no follow-up is an architectural bet that may no longer be valid — and that nobody owns.
Superseded-but-not-updated ADRs. When a newer ADR partially contradicts an older one without explicitly superseding it, teams run on conflicting assumptions. The agent detects overlapping decision scopes and flags pairs where the older record was never marked superseded.
Every unresolved ADR tagged to a service increments that service's fragility score. Weight scales with age: a 6-month-old "revisit" is mild, a 2-year-old one is a landmine. Drift in decisions costs more than drift in code, because the decisions shape every change downstream.
AWS Prescriptive Guidance documents that the dominant cost of ADR neglect is coordination overhead: teams realigning on the same decisions repeatedly because the source record was never closed out[3]. That tax compounds across sprints.
Layer four is the most subtle. It analyzes how reviews happen — not what was reviewed, but the dynamics of the review process. Three sub-signals reveal where the team is uncertain, confused, or working past the limits of shared understanding.
Review cycle count. PRs that go through more than five review cycles before merging signal misalignment — on requirements, on approach, on the team's mental model of the system. High cycle counts per service directory correlate with future defects because the team is not working from a shared model. They are negotiating one in the comments. The DORA 2025 report introduced rework rate as an official fifth metric, with benchmarks showing that more than 26% of teams spend 8-16% of deployment capacity on unplanned fixes[7] — the downstream cost of exactly this misalignment.
Uncertainty language in reviews. Scan review comments for hedging: "I think this is right but…", "not sure about this approach", "this might break", "we should probably", "let's revisit", "I don't fully understand." Concentration of hedging in a specific service is a leading indicator of knowledge gaps that produce bugs three sprints later.
Time-to-first-review. PRs that sit unreviewed for days in specific service directories indicate nobody feels confident reviewing that code. That ownership vacuum is itself a fragility signal. When the one person who understands the service goes on vacation, changes accumulate without meaningful review and the next regression is already in flight.
1–2 review cycles per PR on average
First review within 4 hours
Confident language: 'LGTM', 'clean approach'
Multiple qualified reviewers available
Consistent review depth across team members
5+ review cycles, request-changes loops that don't resolve
First review delayed 2+ days
Hedging language: 'I think this works…', 'not sure about this'
Single reviewer bottleneck, others decline or shallow-review
Review depth varies wildly by who reviews
The four layers combine into a single composite fragility score per service. Default weights are derived from the empirical finding that incident history is the strongest single predictor of future incidents[1], while PR friction is a leading indicator that fires earlier in the lifecycle.
The formula is straightforward: multiply each layer's normalized 0–100 score by its weight and sum. That is the composite. Services score between 0 and 100.
| Layer | Default Weight | Rationale | When to Increase |
|---|---|---|---|
| Incident History | 35% | Strongest empirical predictor of future breakage | High-traffic services where every P1 has business impact |
| Codebase Health | 25% | Captures accumulated neglect; slow-moving but reliable | Services with low test maturity or no observability |
| PR Review Friction | 25% | Leading indicator — fires 1–2 sprints before defects surface | Teams with high turnover or single-owner bottlenecks |
| ADR Audit | 15% | Strategic risk from unresolved decisions | Services undergoing active architectural evolution |
The output of the monthly agent is a fragility register — a ranked table of every service with composite score and per-layer breakdown. Not a dashboard on a TV nobody watches. A document delivered to engineering leadership, structured into three zones:
The register includes a trend indicator for each service — whether fragility increased, decreased, or stayed flat versus last month. A service that moved from 35 to 52 in one month deserves more attention than one that has been stable at 55 for six months. Trend is the leading indicator. Absolute score is the lagging one.
McKinsey research estimates that technical debt represents 20–40% of the total value of a technology estate[8]. That figure does not move engineering teams. A specific list of services ranked by fragility, delivered monthly to the engineering leadership sync, does.
PagerDuty / Opsgenie / Rootly — incident history with service tagging
GitHub / GitLab — PR review metadata, cycle counts, comment text
SonarQube / custom scripts — TODO/FIXME counts, test coverage per directory
Dependabot / Snyk / Renovate — dependency staleness and CVE exposure
ADR repository — decision records with temporal markers
Adding features to fragile services compounds instability. Stabilize first, ship second.
Trend direction matters more than absolute score. Catch degradation while it is still cheap to reverse.
A report nobody reads provides zero value. Build it into an existing cadence or it dies.
Exceptions should be deliberate and documented. Quietly normalized exceptions are how Red Zones become permanent.
Gaming the score is its own signal. Log cosmetic cleanups separately — they tell you which teams feel metric pressure without owning the underlying problem.
Default weights are starting assumptions, not ground truth. After 90 days of monthly snapshots, you have enough data to calibrate against your own incident history.
Run a Spearman rank correlation between each service's fragility score at month one and its actual incident count in months two and three. If the correlation is weak for a specific layer — say, ADR audit shows no relationship to actual incidents in your org — reduce its weight. If PR friction is a near-perfect predictor, push it higher.
The practical test before you trust forward-looking scores: retroactively score your services against the last 12 months of incidents. Check whether high-scoring services had higher incident rates. Most teams see meaningful correlation by month four. If your correlation is weak at month six, the weights are wrong for your organization — tune them.
We got this wrong initially by weighting all four layers equally. The ADR audit penalized teams with strong documentation discipline — the teams writing the most detailed records also had the most "revisit" markers to flag. Reweighting ADR at 15% and incident history at 35% improved correlation between fragility score and actual incidents immediately. The failure mode was using the same weights for both mature and legacy services; mature services with good ADR hygiene need a different default profile than legacy services with no documentation at all.
How do you handle services with no incident history?
No incidents does not mean no risk. It often means insufficient monitoring. For services with zero incident history, increase the weight of codebase health and PR friction signals by 1.5x. Flag services with no alerts configured as a separate monitoring-gap category. Absence of evidence and evidence of absence are not the same thing in this score. Treat zero-incident services with high layer 2-4 scores as 'fragile and unobserved' — your highest-risk category.
What if teams game the TODO count by removing markers without fixing anything?
Cross-reference TODO removal commits with actual code changes. If a commit only deletes comment markers without modifying the surrounding code, flag it as cosmetic cleanup and track it as a separate integrity signal. Gaming the score is its own diagnostic — it tells you exactly which teams feel pressured by the metric and which ones own the underlying problem. Log cosmetic cleanup events in the register alongside the scores.
How long until the heat map becomes predictive?
Three months of monthly snapshots establish meaningful trends. After six months, run correlation analysis between fragility scores and subsequent incidents — that is when you tune the weights against your own data. Most teams see strong predictive correlation by month four. The practical test: retroactively score services against the last 12 months of incidents and check whether services that scored high actually had higher incident rates. If correlation is weak, tune the weights before you trust the forward-looking scores.
Should every service be scored, or only the ones you're worried about?
Score everything. The services you are worried about are already on your radar. The heat map's value is finding the ones nobody is worried about — the quiet adapters, the inherited microservices, the 4,000-line billing shim that last had a commit eight months ago. Those are precisely the services that will surprise you. Selective scoring defeats the purpose.
What is the right cadence — monthly, weekly, or continuous?
Monthly for the fragility register. Weekly or continuous for the incident layer only. Incident signals change fast and should trigger real-time alerts when a service crosses a severity threshold. The other three layers change slowly — codebase health and ADR drift are monthly-scale phenomena. Running the full composite weekly creates noise without adding signal. Set incident layer thresholds to alert immediately; run the full register monthly.
The heat map works because it makes invisible fragility quantified. Senior engineers already carry the list of brittle services in their heads. The heat map gives that intuition a score, a trend line, and a rule of action — and puts it in front of decision-makers who otherwise would not see it until a P1.
Start with incident history and codebase health. Those two layers require the least integration and produce the strongest signal. Add ADR audit and PR friction in month two, once you have baseline data. Within a quarter, the model is tuned to your actual failure patterns instead of defaults written for someone else's stack.
The alternative — waiting for services to announce their own fragility through outages — has a known cost. Stripe's research put developer time lost to technical debt at 42% of capacity[8]. The heat map does not eliminate that cost. It makes the investment deliberate instead of reactive.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.