IBM's count. Five billion new lines added per year. The substrate is growing, not retiring[2]
IBM Z is the dominant platform. Banking, insurance, government — outages here become headlines[3]
10% of that workforce retires every year. 60% of orgs name the shortage as their top modernization constraint[5]
Kyndryl survey. The systems being replaced were already absorbing 60–80% of the IT budget[11]
Replacing the old system is the easy part. Reading it is what kills the project.
A payroll engine that clears $200M a year carries thirty years of special cases. A deduction branch added for one customer in 2008. Rounding behavior that downstream reports quietly depend on. A PERFORM call tree nobody has drawn end-to-end since the Bush administration. One engineer still holds the map in his head. He retires in eight months. After that, the org is flying blind into a rewrite it already committed to the board.
The substrate is growing, not retiring. IBM puts active COBOL at 240 billion lines, with five billion added every year[2]. Median COBOL developer age in the US is 55[5]. That is not an industry warning. It is a countdown on institutional knowledge nobody bothered to write in a form a system could read.
Most AI-coding commentary assumes greenfield: clean requirements, modern tooling, a team that can grep its own repo. Legacy comprehension is a structurally different problem. The specs disagree with the code. The documentation describes the 1994 design intent. The original engineers retired, left, or died. The rules that move actual money live inside programs written in a language 85% of universities stopped teaching by 1995[5]. There is no source of truth except the running binary and the source it compiled from.
AI comprehension agents can extract that logic with higher fidelity than a six-month consulting engagement. The phrase set up correctly is doing the load-bearing work in that sentence. The rest of this piece is what correct looks like: five extraction patterns that bind every claim to a source line and reject anything that does not.
Reading Is a Different Problem From Rewriting
Most failed modernizations are comprehension failures wearing rewrite costumes. The fight: implementation labor versus extraction discipline.
Two different problems get collapsed into one, then mishandled together.
A rewrite problem assumes you know what the system does and asks you to rebuild it in a modern language. A comprehension problem assumes the requirements were never written down in any form that survived, the running code is the only source of truth, and the requirements have to be reverse-engineered before anything else is possible. One assumes the answer. The other admits there is no answer yet.
Documentation drift widens the gap further. Every enterprise legacy system was documented at some point. That documentation is now wrong in ways nobody has marked. The comments describe what a developer intended to write in 1998 — not what the code does after fifteen years of patches. The README describes the interface before the 2011 refactor. The wiki flowcharts describe a processing flow superseded by a 2017 hotfix nobody bothered to write up. The rules have to be inferred from the code itself, against no reliable external reference.
This is not a generation problem. It is an extraction problem with an evidence requirement. Every business rule that gets surfaced has to trace back to specific lines in specific files. Anything else is a guess. Confident-sounding fiction is exactly what corrupts the knowledge base on the way to production. The distinction is the whole game.
Five Extraction Patterns That Hold Under Audit
Every pattern binds claims to source lines. Patterns that skip the binding produce confident hallucinations indistinguishable from rules.
What separates extraction that ships from extraction that produces plausible-sounding garbage is one discipline: every claim traces to a source line. An uncited rule is a hallucination. It might be a correct hallucination — but with no way to verify, you cannot trust it, which means you cannot ship against it. Each of the five patterns below enforces grounding differently. None of them skip it.
| Pattern | What it produces | Grounding signal | When it fails |
|---|---|---|---|
| Function-level summarization | Plain-language summary of a single paragraph or procedure with its called dependencies inlined | Summary cites the paragraph name and line range; copybook expansions called out explicitly | Whole program dumped into context. The model hallucinates behavior for sections it cannot attend to closely |
| Business rule extraction with citation | Numbered rules, each carrying a file:line citation that proves its source | Every rule includes source file, line range, and a verbatim code quote that supports it | Rules inferred across multiple files without explicit source mapping. Cross-file inference breaks the binding |
| Data lineage mapping | Field-level graph: what writes each field, what reads it, what transforms it across programs and copybooks | Every edge cites the program and line where the assignment or transformation occurs | Copybooks not resolved before analysis. MOVE statements become opaque without the copybook context |
| Test generation before rewrite | Behavioral test suite that pins observed input/output behavior of the legacy system | Cases are validated green against the live legacy system before any rewrite work begins | Test data is synthetic instead of drawn from production traces. The edge cases that production already found get missed |
| Interface synthesis for strangler fig | Modern API contract (OpenAPI or equivalent) derived from the legacy program's entry points and data structures | Every API field maps back to a copybook field or working-storage entry with the original name and type preserved | Interface inferred from documentation rather than code. Documentation drift produces a spec the legacy program does not satisfy |
Pattern 1: Summarize One Paragraph at a Time
The context window is not your friend when the context is 40,000 lines of COBOL. The model hallucinates the parts it cannot attend to.
The mistake every team makes on their first run: the context dump. They feed the whole COBOL source file — or worse, several files — into the model and ask "what does this do?" The model answers confidently. The answer is wrong in specific, critical places. The model hallucinated behavior for the sections it could not attend to closely, and nobody knows which parts are fabricated.
Function-level summarization inverts the topology. Extract a single COBOL paragraph, PERFORM block, or RPG subroutine — the smallest coherent unit of behavior. Retrieve and inline the copybooks it references and the immediate callees it invokes. The model sees exactly enough context to summarize that one unit, with its full dependency chain visible, and nothing else.
The load-bearing discipline: the prompt forces the model to cite the paragraph name, the line range, and every copybook it relied on. A summary without citation cannot be verified, and an unverified summary is indistinguishable from a hallucination. The moment citation becomes a hard requirement, the prompt starts surfacing which summaries the model is confident about and which it is guessing. The guesses stop dressing up as facts.[9]
paragraph-extraction-prompt.txtYou are a COBOL code analyst. Summarize the behavior of the paragraph below.
Rules:
1. Open with the paragraph name and line range (e.g., CALC-DEDUCTIONS, lines 847-923).
2. List every copybook referenced and the specific fields used from each.
3. List every PERFORM or CALL target — do not summarize them. Note them as out-of-scope callees.
4. Describe what this paragraph does in plain English: inputs, transformations, outputs.
5. If a branch cannot be determined without more context, say so explicitly.
Do not infer or assume — mark it as requiring additional context.
6. Do not summarize behavior from callees. Only this paragraph.
Paragraph:
[PARAGRAPH_NAME], lines [START]-[END]
[COBOL SOURCE]
Referenced copybook expansions:
[COPYBOOK_CONTENT]Pattern 2: Every Rule Carries a Source Address
Rules without file:line citations are not rules. They are guesses dressed up as policy.
Business rule extraction is the highest-upside, highest-failure-rate pattern of the five. The upside: a structured, queryable catalog of what the system actually does, written in plain language a business reviewer can sign off on. The failure mode when done naively: a catalog full of plausible-sounding rules that mix what the code does with what the analyst assumed it does.
The grounding discipline is non-negotiable. Every extracted rule includes the source file name, the line range, and a verbatim snippet of the code that supports the claim. A rule that reads "Premium calculations apply a 12% loading for customers in the high-risk tier" is useless without PREMIUM-CALC.CBL:1203-1241 and the relevant MULTIPLY statement. With the citation, a business analyst validates it in five minutes. Without it, the validation never happens and the rule quietly corrupts the knowledge base.[10]
Run the extraction twice on the same source. If two runs produce different rules from the same code, the grounding is broken — the model is generating instead of extracting. Treat extraction as a deterministic function: same input, same output. Variance is a hallucination signal, not a quirk.
"The system applies a late-payment penalty of 1.5% after 30 days" — no source citation
"High-risk customers are flagged during onboarding" — inferred from a comment, not from code
"The deduction logic handles federal and state taxes" — summary of a comment block, actual branches never verified
"Payroll runs on Friday evenings" — copied from a README last updated in 2009
"The rate table is sourced from an external feed" — plausible inference, no CALL or READ statement cited
"Late-payment penalty of 1.5% applied when WS-DAYS-OVERDUE > 30" — BILLING-CALC.CBL:445-451,
MULTIPLY WS-BALANCE BY 0.015 GIVING WS-PENALTY"High-risk flag set when CUST-RISK-SCORE > 750" — ONBOARD-VALIDATE.CBL:112-118,
MOVE 'H' TO CUST-RISK-TIER"Federal tax computed in CALC-FED-TAX (lines 847-923), state tax in CALC-STATE-TAX (lines 924-1005)" — separate paragraphs cited individually
"Payroll batch scheduled via JCL job PRJOB042; scheduling config lives outside the COBOL source — requires JCL analysis"
"Rate table loaded via READ RATE-FILE at INIT-RATE-TABLE:33 — external file dependency confirmed in the FD entry"
Pattern 3: Map the Field Graph Before Anyone Writes a Rewrite Ticket
Which field feeds which, across copybooks, CALL chains, and JCL job steps. Without it, the rewrite breaks invisible coupling.
Data lineage is the most technically demanding extraction pattern and the one with the clearest downstream payoff. Legacy systems rarely process data in a single program. They hand it through chains of COBOL programs, RPG procedures, JCL job steps, and shared copybooks. A field named CUST-BALANCE might be set in ACCT-LOAD.CBL, transformed in BALANCE-ADJ.CBL, fed through a copybook CUSTMAST.CPY used by twelve programs, and finally written to a report by RPT-STMT.CBL. No single developer holds that chain in their head.
A lineage agent builds the graph systematically. Start from a target field — the dollar amount that prints on the customer statement, say — and trace backward. What assigns this field? What reads the assigning program's output? Which copybook defines the shared structure? The agent emits a graph where every edge cites the program and line where the data moves. The output is not a summary. It is a queryable map. "Show me everything that writes CUST-BALANCE before the statement run" becomes answerable without a human tracing CALL statements by hand.[12]
The failure mode that destroys the graph: unresolved copybook expansion. COBOL uses copybooks as shared struct definitions, and the same field name can appear in multiple copybooks with different storage layouts. A lineage agent that traces field references without resolving copybook expansions will produce edges that claim a field is shared between programs when they are actually independent fields that happen to share a name. Resolve all copybooks before any lineage analysis runs. This is not optional.
Pattern 4: Generate the Behavioral Test Suite Before a Single Line of New Code
The pattern most rewrites skip. The teams that skip it discover their requirements during production incidents.
One pattern separates modernizations that ship from the ones that quietly fail for six months after go-live: generate a behavioral test suite from the legacy system before you write a single line of the new one.
The suite is not documentation. It is a regression harness derived from observing what the legacy system actually does with real inputs. Run the extraction agent across the codebase to identify the significant behavioral branches. Instrument the legacy system to capture input/output pairs for each branch — ideally from production traffic, or from a test data set built to cover the branches the agent identified. Generate test cases that assert the observed behavior. Then run those cases against the legacy system to confirm they are right. The suite goes green against the thing being replaced.
Now correctness has a definition that does not depend on anyone's memory or on documentation that may be wrong. When the new system passes the suite, the team has evidence — not faith — that it behaves equivalently. When it fails, the divergence is precise: a specific input and a specific output difference, not a vague "something seems off" from a user in UAT.[9]
Teams that skip this step discover their requirements during production incidents. The deduction branch that only fires for one customer processes $4M a year. It was in no spec. It was in no test the new system carried. It shows up three months after cutover in a payroll discrepancy.
- [01]
Identify behavioral branches with the extraction agent
Run function-level summarization across the most critical programs to build a branch inventory. Every conditional in the output is a candidate test case.
- [02]
Build a test data set that covers the branches
Synthetic data misses the edge cases that only exist in production. Use anonymized production traces wherever the data classification allows.
- [03]
Instrument the legacy system to capture outputs for each test input
Run the data set through the legacy system and record the outputs. These become the assertions.
- [04]
Generate test cases and prove them green against the legacy system
Write tests that assert the captured outputs for each input. Run them green against the legacy system before treating the suite as a regression harness.
- [05]
Run the suite against the new system through development
Every divergence is a decision point: bug in the new system, or intentional improvement? Document the decision either way.
Pattern 5: Synthesize the Interface From the Code, Not the Documentation
The strangler fig router gets built against the contract the legacy program actually satisfies — not the one a wiki page claims it does.
The strangler fig pattern — routing new traffic to a modern service while the legacy system handles the rest — depends on a clean interface definition between the two. That interface cannot be guessed or invented. It has to match what the legacy program actually accepts and returns, or the routing layer becomes a translation layer that accumulates its own bugs.
Interface synthesis extracts the API contract directly from the legacy program's entry points: the LINKAGE SECTION in COBOL, the parameter lists in RPG procedures, the method signatures in legacy Java. The agent maps every parameter to its copybook definition or class field, preserves original data types and lengths, and emits an OpenAPI specification (or equivalent) that drives the modern service's contract. Every field in the spec traces back to a named field in the legacy source. No invented names. No guessed types. The strangler fig router is built against this spec, not against an assumption.[6]
This is the shortest path from legacy comprehension to active migration. Once the interface spec is grounded, the new service can stand up and start absorbing a subset of traffic while the legacy system handles the rest — the coexistence playbook in practice.
The Tool Ceiling Matters Less Than the Process Floor
Tools differ in what they ground against and which stacks they understand. Pick based on the stack you operate, not the vendor pitch.
The tool market for legacy comprehension has expanded sharply in the past eighteen months. Every major cloud vendor has a product. IBM has a mainframe-specific offering. A new category of code intelligence platforms has shown up. Honest read: most of these tools help with the comprehension problem. None of them solve it without a grounding discipline imposed by the team. The tool is a retrieval and generation layer. The grounding is the team's responsibility.
The best-performing teams in this space were not using the most sophisticated tool. They were running a general-purpose model with a rigorous prompting discipline and a human validation step that rejected anything without a source citation. The tool ceiling matters less than the process floor. Buying Watsonx and abandoning the grounding discipline produces worse output than running Claude Code with strict citation requirements.
Tools worth evaluating in 2026
- ✓
Anthropic Claude Code — General-purpose, holds up well under structured prompts and grounded retrieval. No built-in COBOL indexer — bring your own — and the five extraction patterns above apply directly. The right pick for teams that want to own the comprehension pipeline.
- ✓
IBM Watsonx Code Assistant for Z — Mainframe-specific, trained on COBOL and the Z ecosystem. NOSI case study showed a 79% reduction in time to understand complex COBOL applications.[7] The value is real. The lock-in is real. Justified for large IBM Z shops, hard to justify outside them.
- ✓
Amazon Q Code Transformation — Java-focused. Strong for upgrading legacy Java 8/11 to Java 17/21, irrelevant for COBOL or RPG.[8] The right tool for modernizing old Java applications. The wrong tool for mainframe comprehension.
- ✓
GitHub Copilot — Solid on modern languages with grounding via workspace indexing. Reads COBOL syntax but lacks the mainframe-semantic training Watsonx carries. Useful as a supplement, not a primary tool for deep COBOL work.[12]
- ✓
Cursor — Strong IDE-based comprehension with codebase indexing. Works well for legacy Java where the IDE can parse the structure. COBOL support needs manual configuration. Best for interactive, file-by-file exploration.
- ✓
Code Comprehend — Newer entrant (beta opened December 2025), purpose-built for legacy comprehension across COBOL, Java, and .NET. Architecture visualization and dependency mapping ship built in. Worth evaluating for teams without the budget for Watsonx.
Five Anti-Patterns That Burn the Modernization Budget
Each one shows up in every failed legacy AI engagement. Name them now or pay six months later.
The Whole-Codebase Dump
Feeding 40,000 lines of COBOL into one context window and asking 'what does this do?' produces confident summaries with embedded hallucinations. The model attends unevenly across long contexts. Function-level extraction with retrieval is not a workaround. It is the only approach that produces verifiable output.
The Confident Translator
Asking the agent to translate COBOL to Java before extracting the business rules produces a Java program that implements what the LLM inferred the COBOL was doing. That inference is usually close to correct. It is not correct in exactly the cases where correct matters most.
Trust Without Verify
Running extraction and treating the output as ground truth without human review is the fastest known way to corrupt a knowledge base. Every extracted rule needs a subject-matter validator — ideally a domain expert, at minimum someone who can verify the citation against the source. The extraction agent produces candidates, not facts.
Skip the Test Data Set
Generating behavioral tests from synthetic data misses the edge cases production has been discovering for thirty years. If anonymized production traces are not accessible, branch coverage on the synthetic data has to be explicitly verified against the branch inventory the extraction agent produced.
Rewrite Before Read
Starting the rewrite before comprehension is complete forces developers to discover requirements during implementation — the most expensive place to discover them. A six-week comprehension program that delays a six-month rewrite by six weeks is almost always the right trade.
What the First 90 Days Look Like When You Actually Start
A working sequence for standing up a legacy comprehension program from a cold start. Phases gate on evidence, not calendar.
- [01]
Scope and inventory the target system (Days 1–14)
Before any agent touches the codebase, establish what you are operating against. This is archaeology before excavation.
- [02]
Stand up the retrieval and indexing layer (Days 15–30)
The comprehension agent is only as good as its retrieval layer. Invest here before writing a single extraction prompt.
- [03]
Run extraction on the critical programs and validate (Days 31–60)
Extract function summaries and business rules for the priority programs, with human validation at every step.
- [04]
Generate and validate the behavioral test suite (Days 61–90)
Use the validated knowledge base to drive test case generation. The 90 days end with a green regression suite against the legacy system, or they end early.
Questions That Show Up Once a Team Starts Operating This
We don't have any COBOL developers left — can we even run this?
Yes. That is exactly the problem the comprehension agent is built for. The pipeline does not need a COBOL developer. It needs someone who can validate business rules against the source citations. That role can be a domain expert who understands the business logic — a senior actuary, a payroll specialist, a compliance officer — anyone who can look at a cited rule and say whether it describes correct behavior. The extraction agent handles the COBOL reading. Human validation handles the business correctness check. The two are separable, and that separation is the whole point.
What about RPG and CL on AS/400?
The five patterns apply directly. IBM's Watsonx Code Assistant for Z, after the 'Project Bob' consolidation announced in late 2025, covers both IBM Z COBOL and IBM i RPG in a unified product. For RPG, the copybook equivalent is the data structure definition (/COPY members), and the CALL chain equivalent is the procedure call tree. Both resolve the same way as COBOL inside the extraction pipeline. The grounding discipline is identical.
Is IBM Watsonx Code Assistant for Z worth the price?
For large IBM Z shops running significant COBOL volume, probably yes. The NOSI case study showed a 79% reduction in comprehension time, and the product carries mainframe-specific training that general-purpose models lack.[7] For smaller shops or organizations where the COBOL is a contained subsystem rather than the core platform, the licensing cost likely exceeds the benefit. The honest comparison is: what does a 6-month consulting engagement cost versus a year of Watsonx licensing plus the productivity gain? For large banks and government agencies running mainframes as primary infrastructure, the math favors Watsonx. For an insurance company with 200K lines of COBOL in one legacy module, it does not.
How do we handle copybooks and macros?
Resolve them before any extraction runs. Non-negotiable. A COBOL program that references a copybook is incomplete without it. The extraction agent needs the expanded form to trace field references and data movements correctly. Build a pre-processing step in the indexer that resolves all COPY statements and inline-expands the copybook content. Tag expanded sections with their source copybook name so citations stay traceable. The same principle covers JCL symbolic substitution and RPG /COPY directives.
When is the comprehension good enough to start the rewrite?
When the behavioral test suite is green against the legacy system and covers the critical branches the business rule catalog flagged. That is the objective criterion. The subjective version — 'we feel like we understand it well enough' — is exactly how teams end up discovering requirements in production. The test suite gives a falsifiable definition of good enough. If the new system passes the suite and every divergence is reviewed and approved, the comprehension was sufficient. Starting the rewrite before the suite goes green is a bet on your own comprehension — the exact problem the program was meant to solve.
Pre-Rewrite Comprehension Checklist
Production source confirmed — not an archived snapshot that diverged from production
All copybooks / /COPY members / data structure definitions inventoried and indexed
Call graph mapped — entry points and utility programs identified, not assumed
Five to ten most business-critical programs prioritized with stakeholder evidence, not headcount
Retrieval indexer proven: querying a known field name returns the correct paragraph
Extraction run twice on the same source — outputs match (determinism confirmed, not hoped for)
Every business rule in the knowledge base carries a source file:line citation
Every rule has been validated by a domain expert against its citation
Data lineage map built for fields feeding critical financial calculations and state transitions
Behavioral test cases generated from the validated business rule catalog and branch inventory
Test suite green against the live legacy system before a single line of new code is written
Interface specification for strangler fig migration derived from legacy entry points — not documentation
Modernizations fail because the system gets rewritten before it gets understood. Confidence fills the comprehension gap — developer confidence, consultant confidence, project manager confidence that the legacy system is simpler than it looks. It is never simpler than it looks. Thirty years of business logic, edge cases, and undocumented behavior accumulated in that COBOL because the business kept changing and the system kept being patched to match. Read first. Map the rules. Generate the tests. Then decide whether to rewrite.
The extraction patterns here are not new research. They are disciplines careful teams have applied to legacy systems for years, now executable at speed with AI comprehension agents. Gartner's projection that 40% of legacy modernization projects will incorporate AI-assisted reverse engineering by 2026, up from under 10% in 2023, reflects a real shift in what is practical.[10] The tools are good enough. The constraint is the discipline to use them correctly. Ground every claim. Validate every rule. A fast extraction is not the same thing as a correct one, and the system being replaced cannot tell you the difference.
- [1]The Stack: There's Over 800 Billion Lines of COBOL in Daily Use(thestack.technology)↩
- [2]IBM: What Is COBOL Modernization? — 240 Billion Lines Active, 5 Billion Written Annually(ibm.com)↩
- [3]Precisely: 9 Mainframe Statistics — 71% of Fortune 500 Companies Use Mainframes(precisely.com)↩
- [4]MetaIntro: The $3 Trillion Code Nobody Knows How to Fix — COBOL Developer Shortage 2026(metaintro.com)↩
- [5]Integrative Systems: Why COBOL Programmers Are Still in Demand in 2026 — Average Developer Age 55(integrativesystems.com)↩
- [6]IBM: watsonx Code Assistant for Z — Mainframe COBOL Modernization(ibm.com)↩
- [7]IBM Research: Watsonx Code Assistant for Z — 79% Reduction in Time to Understand COBOL Applications (NOSI Case Study)(research.ibm.com)↩
- [8]AWS: Amazon Q Developer Transform — Java Legacy Modernization(aws.amazon.com)↩
- [9]HackerNoon: AI Agents vs. COBOL — How Legacy Mainframes Are Being Reverse-Engineered at Scale(hackernoon.com)↩
- [10]SoftwareSeni: Cutting Legacy Reverse Engineering Time by 66% with AI Code Comprehension(softwareseni.com)↩
- [11]EPAM: Mainframe Modernization ROI — A Cost-Focused Guide, Average Annualized Savings $23.3M(epam.com)↩
- [12]GitHub Blog: How GitHub Copilot and AI Agents Are Saving Legacy Systems(github.blog)↩
- [13]Fujitsu: Generative AI Service That Analyzes COBOL Source Code and Automatically Generates Design Documents (March 2026)(global.fujitsu)↩