Four signal layers — incident history, codebase health, ADR audit gaps, PR review friction — scored monthly per service. The output is a ranked fragility register that names your next outage weeks before it happens. The riskiest service is almost never the largest one. It is usually the one nobody has touched.
Most engineering orgs find their fragile services through a P1. The knowledge already existed — scattered across incident channels, abandoned ADRs, the instincts of senior engineers who have been burned before. Nobody synthesizes it. The heat map is the synthesis.
First time we ran this scoring on a mid-size platform, the top-fragility service was a 4,000-line payments adapter nobody had touched in eight months. Zero recent incidents. Looked clean. The signals said otherwise: 9 TODOs per KLOC pointing at a deprecated API, 18% coverage on recently-modified files, PR review cycles averaging 7 rounds. Three months later it caused a two-hour outage during peak traffic. The heat map had flagged it as critical for six weeks before that.
Fragility Is the Signal. Size Is a Distractor.
Large codebases are not inherently fragile. Small services with accumulated neglect are.
Engineering teams confuse size with risk. The 200,000-line monolith gets the attention. Meanwhile a 3,000-line billing service with zero tests, four unresolved TODOs pointing at a deprecated API, and a dependency three majors behind quietly accumulates blast radius.
Meta's Diff Risk Score research, published August 2025[1], showed predictive models trained on historical incident correlation outperform models trained on code complexity metrics. Lines of code do not predict breakage. Cyclomatic complexity does not predict breakage. Patterns of neglect, uncertainty, and unresolved decisions predict breakage.
The heat map captures exactly those signals. It asks one question: which services have the highest concentration of deferred maintenance, unreviewed architectural bets, and review-cycle friction? Those are the services that will surprise you next.
Layer 1: Past Breakage Predicts Future Breakage
Incident history is the strongest single predictor. Weight it accordingly.
Pull 90 days of incident data from the ITSM platform — PagerDuty, Opsgenie, Rootly, FireHydrant. Aggregate by affected service. Raw counts mislead. A service with ten P4 informational alerts is not more fragile than one with two P1 outages.
Weight the score by severity and recency. P1 in the last 30 days: 10 points each. P2: 5. P3: 2. Decay every signal by 50% per 30-day period. An incident from 60 days ago counts half as much as one from last week. The result is a recency-weighted severity score that captures both the magnitude and the trajectory.
This layer also tags incident adjacency. When Service A goes down because Service B failed upstream, both services accumulate score — but with different tags: "origin" versus "blast-radius victim." That distinction is how you decide where to invest remediation. Hardening a victim does not stop the origin from breaking again.
| Severity | Base Score | 0-30 Days | 30-60 Days | 60-90 Days |
|---|---|---|---|---|
| P1 — Full outage | 10 | 10.0 | 5.0 | 2.5 |
| P2 — Degraded service | 5 | 5.0 | 2.5 | 1.25 |
| P3 — Minor impact | 2 | 2.0 | 1.0 | 0.5 |
| P4 — Informational | 1 | 1.0 | 0.5 | 0.25 |
Layer 2: Read the Code for Signs of Neglect
Static signals extracted from the repo. TODOs, coverage gaps, dependency drift.
Layer two mines the codebase directly. Three sub-signals combine into a single health score.
TODO/FIXME/HACK concentration. Count debt markers per thousand lines of code. A service at 0.5 TODOs per KLOC is maintained. A service at 8 per KLOC is accumulating shortcuts nobody comes back to. Weight markers that reference a specific issue or date more heavily — those are known problems with deferred resolution, not vague intent.
Test coverage gap. Overall coverage is a blunt instrument. Measure coverage on files modified in the last 90 days instead. A service with 85% overall coverage but 20% coverage on recently-modified files is more fragile than one with 60% overall coverage that is consistently tested where it changes. The delta between overall and active-file coverage is the signal.
Dependency staleness. For each service, calculate the average age of direct dependencies versus the latest available versions. A service pinned 18 months behind, especially on security-critical dependencies, carries compounding risk. Weight dependencies with known CVEs in the gap at 3x. Drift here is not a future problem — it is an exploited problem waiting for someone to notice.
codebaseHealthScorer.ts// Per-service health score. Weights derived empirically — recalibrate after
// 90 days of data against your own incident history.
interface CodebaseHealthSignals {
todoConcentration: number; // TODOs per KLOC
testCoverageGap: number; // overall% - active_file%
staleDependencies: number; // avg months behind latest
cveExposure: number; // count of CVEs in dep gap
}
function scoreCodebaseHealth(signals: CodebaseHealthSignals): number {
const todoScore = Math.min(signals.todoConcentration / 10, 1) * 25;
const coverageScore = Math.min(signals.testCoverageGap / 50, 1) * 30;
const staleScore = Math.min(signals.staleDependencies / 24, 1) * 20;
const cveScore = Math.min(signals.cveExposure / 5, 1) * 25;
return Math.round(todoScore + coverageScore + staleScore + cveScore);
}Layer 3: The ADR That Said "Revisit in Q3" and Nobody Did
Architecture decisions accumulate as bets. Unrevisited bets become landmines.
Architecture Decision Records are powerful when maintained[2]. They become dangerous when abandoned. Layer three parses the ADR repo and flags two failure classes.
Unresolved revisit markers. ADRs use phrases like "revisit after migration completes," "temporary until we evaluate alternatives in Q3," or "accepted risk — reassess in 6 months." The agent scans for those temporal markers and checks whether any follow-up ADR or PR addressed them. An ADR from 18 months ago that says "revisit in Q2 2025" with no follow-up is an architectural bet that may no longer be valid — and that nobody owns.
Superseded-but-not-updated ADRs. When a newer ADR partially contradicts an older one without explicitly superseding it, teams run on conflicting assumptions. The agent detects overlapping decision scopes and flags pairs where the older record was never marked superseded.
Every unresolved ADR tagged to a service increments that service's fragility score. Weight scales with age. A 6-month-old "revisit" is mild. A 2-year-old one is a landmine. Drift in decisions costs more than drift in code, because the decisions shape every change downstream.
Layer 4: PR Review Friction Is Knowledge Gap, Visible
Layer four is the most subtle. It analyzes how reviews happen — not what was reviewed, but the dynamics of the review process. Three sub-signals reveal where the team is uncertain, confused, or working past the limits of shared understanding.
Review cycle count. PRs that go through more than five review cycles before merging are signaling misalignment — on requirements, on approach, on the team's mental model of the system. High cycle counts per service directory correlate with future defects because the team is not working from a shared model. They are negotiating one in the comments.
Uncertainty language in reviews. Scan review comments for hedging: "I think this is right but…", "not sure about this approach", "this might break", "we should probably", "let's revisit", "I don't fully understand." Concentration of hedging in a specific service is a leading indicator of knowledge gaps that produce bugs three sprints later.
Time-to-first-review. PRs that sit unreviewed for days in specific service directories indicate nobody feels confident reviewing that code. That ownership vacuum is itself a fragility signal. When the one person who understands the service goes on vacation, changes accumulate without meaningful review and the next regression is already in flight.
1-2 review cycles per PR on average
First review within 4 hours
Confident language: 'LGTM', 'clean approach'
Multiple qualified reviewers available
Consistent review depth across team members
5+ review cycles, request-changes loops that won't resolve
First review delayed 2+ days
Hedging language: 'I think this works…', 'not sure'
Single reviewer bottleneck, others decline
Review depth varies wildly by who reviews
The Fragility Register: A Document, Not a Dashboard
The output of the monthly agent is a fragility register — a ranked table of every service with composite score and per-layer breakdown. Not a dashboard on a TV nobody watches. A document delivered to engineering leadership, structured into three zones:
- Red Zone (70-100): Services that need attention now. Schedule remediation sprints or reduce deploy frequency until the score drops.
- Yellow Zone (40-69): Services accumulating risk. Land in next quarter's debt budget. Assign an owner for the highest-contributing signal layer.
- Green Zone (0-39): Services operating inside acceptable risk. Watch the trend.
The register includes a trend indicator for each service — whether fragility increased, decreased, or stayed flat versus last month. A service that moved from 35 to 52 in one month deserves more attention than one that has been stable at 55 for six months. Trend is the leading indicator. Absolute score is the lagging one.
Data Sources Required
- ✓
PagerDuty / Opsgenie / Rootly — incident history with service tagging
- ✓
GitHub / GitLab — PR review metadata, cycle counts, comment text
- ✓
SonarQube / custom scripts — TODO/FIXME counts, test coverage per directory
- ✓
Dependabot / Renovate — dependency staleness and CVE exposure
- ✓
ADR repository — decision records with temporal markers
Signal Layer Weights — Starting Point, Not Gospel
Incident History: 35% — strongest direct predictor of future breakage
Codebase Health: 25% — captures accumulated neglect and maintenance debt
PR Review Friction: 25% — leading indicator of knowledge gaps
ADR Audit: 15% — captures strategic risk from unresolved decisions
Operating Rules for the Register
Red Zone services do not accept new feature work until fragility score drops below 70
Adding features to fragile services compounds instability. Stabilize first, ship second.
Every service that moves from Green to Yellow gets an owner assigned within 5 business days
Trend direction matters more than absolute score. Catch degradation while it is still cheap to reverse.
The register is reviewed in the monthly engineering leadership sync — not optional
A report nobody reads provides zero value. Build it into an existing cadence or it dies.
Override requests for Red Zone feature work require VP-level approval with a written remediation plan
Exceptions should be deliberate and documented. Quietly normalized exceptions are how Red Zones become permanent.
How do you handle services with no incident history?
No incidents does not mean no risk. It often means insufficient monitoring. For services with zero incident history, increase the weight of codebase health and PR friction signals by 1.5x. Flag services with no alerts configured as a separate monitoring-gap category. Absence of evidence and evidence of absence are not the same thing in this score.
What if teams game the TODO count by removing markers without fixing anything?
Cross-reference TODO removal commits with actual code changes. If a commit only deletes comment markers without modifying the surrounding code, flag it as cosmetic cleanup and track it as a separate integrity signal. Gaming the score is its own diagnostic — it tells you exactly which teams feel pressured by the metric and which ones own the underlying problem.
How long until the heat map becomes predictive?
Three months of monthly snapshots establish meaningful trends. After six months, run correlation analysis between fragility scores and subsequent incidents — that is when you tune the weights against your own data, not the defaults. Most teams see strong predictive correlation by month four. The practical test: retroactively score services against the last 12 months of incidents and check whether services that scored high actually had higher incident rates. If the correlation is weak, tune the weights before you trust the forward-looking scores.
The heat map works because it makes invisible fragility visible and quantified. Teams already know which services are brittle — every senior engineer carries the list in their head. The heat map gives that intuition a number, a trend line, and a rule of action. Start with incident history and codebase health — the layers that require the least integration. Add ADR audit and PR friction in month two, once you have baseline data. Within a quarter, the model is tuned to your actual failure patterns instead of someone else's.
One thing we got wrong initially: we weighted all four layers equally. The ADR audit ended up dragging down teams with good ADR hygiene — the teams writing the most detailed records also had the most "revisit" markers to flag. We reweighted ADR at 15% and incident history at 35% after three months. Correlation between fragility score and actual incidents improved immediately. The recommended weights are a starting point. The signal each layer carries depends on your team's documentation discipline, your alerting maturity, and the specific shape of your last twelve months. Calibrate before you trust.
- [1]Diff Risk Score (DRS): AI-Aware Software Development — Meta Engineering(engineering.fb.com)↩
- [2]Architecture Decision Records (ADR) Process — AWS Prescriptive Guidance(docs.aws.amazon.com)↩
- [3]Master Architecture Decision Records: Best Practices for Effective Decision Making — AWS(aws.amazon.com)↩
- [4]The Modern Risk Prioritization Framework for 2026 — Safe Security(safe.security)↩