A legacy codebase reading agent starts with a premise most AI content ignores: the hardest part of modernizing a legacy system isn't the rewrite. It's reading what's already there. A payroll engine that processes $200M a year has accumulated thirty years of special cases, branch logic for one customer, deduction rules that predate current tax law, and a PERFORM call tree that nobody has drawn out in full since 2003. One person understands it. He retires in eight months. Then the organization is flying blind.
Most AI-assisted development articles assume greenfield. You have clean requirements, modern tooling, and a team that can read every file. The legacy comprehension problem is structurally different. There are no specs that match the code. The documentation describes what the system was supposed to do in 1994. The original developers are gone, retired, or unavailable. The business rules — the actual logic driving actual money — live entirely inside programs written in a language 85% of universities stopped teaching in the 1990s.[5]
The good news: AI code-comprehension agents can extract business logic from these systems with better fidelity than a six-month consulting engagement, if set up correctly. The phrase "if set up correctly" is doing enormous work in that sentence. This piece is about what correct looks like — five extraction patterns that ground every claim in source code lines and stop confident hallucinations before they corrupt your knowledge base.
The Comprehension Gap Nobody Talks About
Why reading the legacy system is a different problem from rewriting it — and why most modernization failures start here
The rewrite conversation and the comprehension conversation are often treated as the same conversation. They aren't.
A rewrite problem is: given clear requirements, rebuild this functionality in a modern language. A comprehension problem is: we don't have clear requirements, the only source of truth is decades-old code, and we need to extract the requirements from the implementation before we can do anything else. One assumes you know what the system does. The other assumes you don't.
Documentation drift makes this worse than it sounds. Most enterprise legacy systems were documented at some point — and that documentation is now wrong in ways that aren't labeled. The comments describe what a developer intended to write in 1998, not what the code does after fifteen years of patches. The README describes the interface before the last major refactor, which happened in 2011. The flowcharts in the wiki describe a processing flow that was superseded by a hotfix in 2017 that nobody documented at all. The actual business rules have to be inferred from the code itself, with no reliable external reference to validate against.
This means the comprehension problem requires a fundamentally different approach than generating new code. It requires extraction with evidence: every business rule you surface needs to trace back to specific lines in specific files, or it's not a business rule — it's a guess about what the business rule might be. That distinction separates functional knowledge bases from confident-sounding fiction.
Five Extraction Patterns That Actually Work
Each pattern grounds claims in source lines. Patterns that don't ground produce confident hallucinations.
What separates extraction that works from extraction that produces plausible-sounding garbage is a single discipline: every claim traces to a source line. An uncited rule is a hallucination. It may be a correct hallucination, but you have no way to know that, which means you can't trust it. The five patterns below all enforce grounding differently, but they all enforce it.
| Pattern | What it produces | Grounding signal | When it fails |
|---|---|---|---|
| Function-level summarization | Plain-language summary of a procedure or paragraph with called dependencies inlined | Summary must cite the paragraph name and line range it describes; copybook expansions noted explicitly | When the whole program is dumped as context — LLM hallucinates behavior not in the specific function |
| Business rule extraction with citation | Numbered rules with file:line citations proving each rule's source | Every rule entry includes source file, line range, and a direct code quote that supports it | When rules are inferred from multiple files without explicit source mapping — cross-file inference breaks grounding |
| Data lineage mapping | Field-level graph: what produces each field, what consumes it, what transforms it across programs and copybooks | Each edge in the lineage graph cites the program and line where the assignment or transformation occurs | When copybook expansions are not resolved before analysis — MOVE statements become opaque without the copybook context |
| Test generation before rewrite | Behavioral test suite that captures observed input/output behavior of the legacy system | Test cases are validated by running them against the live legacy system before any rewrite begins | When test data is synthetic rather than derived from production traces — edge cases are missed |
| Interface synthesis for strangler fig | Modern API specification (OpenAPI or equivalent) derived from the legacy program's entry points and data structures | Each API field maps back to a copybook field or working-storage entry with its original name and type | When the interface is inferred from documentation rather than code — documentation drift produces incorrect specs |
Pattern 1: Function-Level Summarization (Not Whole-Codebase Dumps)
The context window is not your friend when the context is 40,000 lines of COBOL
The most common mistake teams make when they first try AI comprehension on a legacy codebase is the context dump. They feed the entire COBOL source file — or worse, multiple files — into the model and ask "what does this do?" The model answers confidently. The answer is wrong in specific, critical ways. The model hallucinated behavior for the section it couldn't attend to closely at the end of the context window, and nobody knows which parts are fabricated.
Function-level summarization works differently. You extract a single COBOL paragraph, PERFORM block, or RPG subroutine — the smallest coherent unit of behavior. You then retrieve and inline the copybooks it references and the immediate callees it invokes. The model sees exactly enough context to summarize what this specific unit does, with its full dependency chain visible, and nothing else.
The critical discipline: the prompt must require the model to cite the paragraph name, the line range, and any copybook it relied on in its summary. A summary without that citation cannot be verified, and an unverified summary is indistinguishable from a hallucination. The moment you establish citation as a hard requirement in the prompt, you discover which summaries the model is confident about and which ones it's guessing. The guesses stop masquerading as facts.[9]
extraction prompt templateYou are a COBOL code analyst. Summarize the behavior of the COBOL paragraph below.
Rules:
1. Cite the paragraph name and line range (e.g., CALC-DEDUCTIONS, lines 847-923) at the start.
2. List every copybook referenced and the specific fields used from each.
3. List every PERFORM or CALL target — do not summarize them; note them as out-of-scope callees.
4. Describe what the paragraph does in plain English: inputs, transformations, outputs.
5. If you cannot determine the behavior of any branch without more context, say so explicitly.
Do not infer or assume — mark it as requiring additional context.
6. Do not summarize behavior from callees. Only this paragraph.
Paragraph:
[PARAGRAPH_NAME], lines [START]-[END]
[COBOL SOURCE]
Referenced copybook expansions:
[COPYBOOK_CONTENT]Pattern 2: Business Rule Extraction with Citation Grounding
Every extracted rule gets a source address. Rules without addresses are not rules — they're guesses.
Business rule extraction is the pattern with the highest upside and the highest failure rate. The upside: a structured, queryable catalog of what the system actually does, written in plain language that business stakeholders can review. The failure rate when done naively: a catalog full of plausible-sounding rules that mix what the code does with what the analyst assumed it does.
The grounding discipline for rule extraction is non-negotiable: every extracted rule must include the source file name, the line range that implements it, and a verbatim snippet of the code that supports the claim. A rule that reads "Premium calculations apply a 12% loading for customers in the high-risk tier" is useless without PREMIUM-CALC.CBL:1203-1241 and the relevant MULTIPLY statement. With the citation, the business analyst can validate it in five minutes. Without it, the validation never happens, and the rule quietly corrupts the knowledge base.[10]
Run the extraction twice on the same source. If the two runs produce different rules from the same code, your grounding is broken — the model is generating rather than extracting. Treat extraction as a deterministic function: same input, same output. Variance is a signal of hallucination.
"The system applies a late-payment penalty of 1.5% after 30 days" — no source citation
"High-risk customers are flagged during onboarding" — inferred from comment, not from code
"The deduction logic handles federal and state taxes" — summary of comment block, actual logic not verified
"Payroll runs on Friday evenings" — copied from a README that was last updated in 2009
"The rate table is sourced from an external feed" — plausible inference, no CALL or READ statement cited
"Late-payment penalty of 1.5% applied when WS-DAYS-OVERDUE > 30" — BILLING-CALC.CBL:445-451,
MULTIPLY WS-BALANCE BY 0.015 GIVING WS-PENALTY"High-risk flag set when CUST-RISK-SCORE > 750" — ONBOARD-VALIDATE.CBL:112-118,
MOVE 'H' TO CUST-RISK-TIER"Federal tax calculated in CALC-FED-TAX (lines 847-923), state tax in CALC-STATE-TAX (lines 924-1005)" — separate paragraphs cited individually
"Payroll batch scheduled via JCL job PRJOB042; scheduling config not in COBOL source — requires JCL analysis"
"Rate table loaded via READ RATE-FILE at INIT-RATE-TABLE:33 — external file dependency confirmed in FD entry"
Pattern 3: Data Lineage Mapping Across Files and Programs
Which field feeds which, across copybooks, CALL chains, and JCL job steps
Data lineage mapping is the most technically demanding extraction pattern, and the one with the clearest downstream value. Most legacy systems don't process data in a single program — they pass it through a chain of COBOL programs, RPG procedures, JCL job steps, and shared copybooks. A field named CUST-BALANCE might be set in ACCT-LOAD.CBL, transformed in BALANCE-ADJ.CBL, fed through a copybook CUSTMAST.CPY used by twelve programs, and ultimately written to a report by RPT-STMT.CBL. No single developer holds that chain in their head.
A data lineage agent builds this graph systematically. It starts from a target field — say, the field that feeds the dollar amount on the customer statement — and traces backwards: what assigns this field? What reads the assigning program's output? What copybook defines the shared structure? The agent emits a graph where every edge cites the program and line number where the data moves. The result isn't a summary — it's a map you can query. "Show me everything that writes to CUST-BALANCE before the statement run" becomes answerable without a human tracing CALL statements by hand.[12]
The failure mode to guard against: copybook expansion. COBOL programs use copybooks as shared struct definitions, and the same field name can appear in multiple copybooks with different storage layouts. A lineage agent that doesn't resolve copybook expansions before tracing field references will produce a broken graph — edges that claim a field is shared between programs when they're actually independent fields with the same name. Resolve all copybooks before any lineage analysis begins. This is not optional.
Pattern 4: Generate Tests BEFORE You Rewrite
The most important pattern of the five. Most rewrites skip it and discover the requirements during incidents.
If there is one pattern that separates modernizations that succeed from ones that quietly fail for six months after go-live, it's this: generate a behavioral test suite from the legacy system before you write a single line of the new system.
The test suite is not documentation. It's a regression harness derived from observing what the legacy system actually does with real inputs. You run the extraction agent against the codebase to identify the significant behavioral branches. You instrument the legacy system to capture input/output pairs for each branch — ideally from production traffic, or from a test data set built to cover the branches the agent identified. You generate test cases that assert the observed behavior. Then you run those test cases against the legacy system to verify they're correct. The suite is green against the thing you're replacing.
Now you have a definition of correct behavior that doesn't depend on anyone's memory or any documentation that may be wrong. When the new system passes the suite, you have evidence — not faith — that it behaves equivalently. When it fails, you have a precise description of where it diverges. The business analyst reviewing the divergence is looking at a specific input and a specific output difference, not a vague "something seems off" from a user in UAT.[9]
The teams that skip this step discover their requirements during production incidents. The deduction branch that only fires for one customer processes $4M a year. It wasn't in any spec. It wasn't in any test the new system had. It shows up three months after cutover in a payroll discrepancy.
- 1
Identify behavioral branches with the extraction agent
Run function-level summarization across the most critical programs to build a branch inventory. Each conditional in the summarization output is a potential test case.
- 2
Build a test data set that covers the branches
Synthetic test data misses edge cases that only exist in production. Use anonymized production traces where possible.
- 3
Instrument the legacy system to capture output for each test input
Run the test data set through the legacy system and record the outputs. These become your assertions.
- 4
Generate test cases from the corpus and validate them against the legacy system
Write tests that assert the captured outputs for each input. Run them green against the legacy system before using them as a regression suite.
- 5
Run the suite against the new system during development
Every divergence from the legacy output is a decision point: is this divergence a bug in the new system, or is it an intentional improvement? Document the decision.
Pattern 5: Interface Synthesis for the Strangler Fig
Generate the modern API contract from the legacy program's entry points before building the replacement
The strangler fig migration pattern — routing new traffic to a modern service while the legacy system handles the rest — requires a clean interface definition between the two. That interface can't be guessed or invented. It has to match what the legacy program actually accepts and returns, or the routing layer becomes a translation layer that accumulates its own bugs.
Interface synthesis extracts the API contract from the legacy program's entry points: the LINKAGE SECTION in COBOL, the parameter lists in RPG procedures, the method signatures in legacy Java. The agent maps each parameter to its copybook definition or class field, preserves the original data types and lengths, and generates an OpenAPI specification or equivalent that can drive the modern service's contract. Every field in the spec traces back to a named field in the legacy source — no invented field names, no guessed types. The strangler fig router is built against this spec, not against an assumption.[6]
This is the shortest path from legacy comprehension to active migration. Once you have a grounded interface spec, you can build the new service and route a subset of traffic to it while the legacy system handles the rest — the coexistence playbook in practice.
The Tool Landscape (Honestly Assessed)
Tools differ in what they ground against and which language stacks they understand. Pick based on your stack, not the vendor pitch.
The tool market for legacy comprehension has expanded significantly in the past eighteen months. Every major cloud vendor has a product, IBM has a mainframe-specific offering, and a new category of code intelligence platforms has appeared. The honest assessment: most of these tools help with the comprehension problem. None of them solve it without a grounding discipline imposed by your team. The tool is a retrieval and generation layer. The grounding is your responsibility.
Tools worth evaluating in 2026
- ✓
Anthropic Claude Code — General-purpose, strong with structured prompts and grounded retrieval. No built-in COBOL indexer, so you bring your own; the five extraction patterns above apply directly. Best for teams comfortable building their own comprehension pipeline.
- ✓
IBM Watsonx Code Assistant for Z — Mainframe-specific, trained on COBOL and the Z ecosystem. NOSI case study showed 79% reduction in time to understand complex COBOL applications.[7] Expensive and IBM-flavored: the value is real, the lock-in is real. Justified for large IBM Z shops.
- ✓
Amazon Q Code Transformation — Java-focused; strong for upgrading legacy Java 8/11 to Java 17/21, less relevant for COBOL or RPG.[8] The right tool for modernizing old Java applications, the wrong tool for mainframe comprehension.
- ✓
GitHub Copilot — Excellent for modern languages with good grounding via workspace indexing. Understands COBOL syntax but lacks the domain-specific training of Watsonx for mainframe semantics. Better as a supplement than a primary tool for deep COBOL work.[12]
- ✓
Cursor — Strong IDE-based comprehension with codebase indexing. Works well for legacy Java where the IDE can parse the structure. COBOL support requires manual configuration. Good for interactive, file-by-file exploration.
- ✓
Code Comprehend — A newer entrant (beta opened December 2025) purpose-built for legacy comprehension across COBOL, Java, .NET. Architecture visualization and dependency mapping built in. Worth evaluating for teams without the budget for Watsonx.
Anti-Patterns That Burn Modernization Budgets
Each of these shows up in every failed legacy AI engagement. Recognize them before they cost six months.
The Whole-Codebase Dump
Feeding 40,000 lines of COBOL into a single context window and asking 'what does this do?' produces confident summaries with embedded hallucinations. The model attends unevenly across large contexts. Function-level extraction with retrieval is not a workaround — it's the only approach that produces verifiable output.
The Confident Translator
Asking the agent to translate COBOL to Java without first extracting the business rules produces a Java program that implements what the LLM inferred the COBOL was doing. That inference is often close to correct. It is not correct in the specific cases where correct matters most.
The Trust Without Verify
Running extraction and treating the output as ground truth without human review is the fastest way to corrupt a knowledge base. Every extracted rule needs a subject-matter validator — ideally a domain expert, minimally someone who can verify the citation against the source code. The extraction agent produces candidates, not facts.
The Skip the Test Data Set
Generating behavioral tests with synthetic data misses the edge cases that production has discovered over thirty years. If you can't access anonymized production traces, the branch coverage of your synthetic data needs to be explicitly verified against the branch inventory from the extraction agent.
The Rewrite Before Read
Starting the rewrite before the comprehension program is complete forces developers to discover requirements during implementation — the most expensive place to discover them. A six-week comprehension program that delays a six-month rewrite by six weeks is almost always the right trade.
What This Looks Like in Your First 90 Days
A practical sequence for standing up a legacy comprehension program from scratch
- 1
Scope and inventory the target system (Days 1–14)
Before any agent touches the codebase, establish what you're working with. This is archaeology before excavation.
- 2
Stand up the retrieval and indexing layer (Days 15–30)
The comprehension agent is only as good as its retrieval layer. Invest here before writing a single extraction prompt.
- 3
Run extraction on the critical programs and validate (Days 31–60)
Extract function summaries and business rules for the priority programs, with human validation at each step.
- 4
Generate and validate the behavioral test suite (Days 61–90)
Use the validated knowledge base to drive test case generation. End the 90 days with a green regression suite against the legacy system.
Common Questions
We don't have any COBOL developers left — can we even run this?
Yes — this is specifically the problem the comprehension agent addresses. You don't need a COBOL developer to run the extraction pipeline; you need someone who can validate business rules against the source citations. That can be a domain expert who understands the business logic (a senior actuary, a payroll specialist, a compliance officer) who can look at a cited rule and say whether it describes correct behavior. The extraction agent handles the COBOL reading. Human validation handles the business correctness check.
What about RPG and CL on AS/400?
The five patterns apply directly to RPG and CL. IBM's Watsonx Code Assistant for Z, as of the 'Project Bob' consolidation announced in late 2025, covers both IBM Z COBOL and IBM i RPG in a unified product. For RPG specifically, the copybook equivalent is the data structure definition (/COPY members), and the CALL chain equivalent is the procedure call tree — both resolve the same way as COBOL in the extraction pipeline. The grounding discipline is identical.
Is IBM Watsonx Code Assistant for Z worth the price?
For large IBM Z shops running significant COBOL volume, probably yes — the NOSI case study showed 79% reduction in comprehension time, and the product has mainframe-specific training that general-purpose models lack.[7] For smaller shops or organizations where the COBOL is a contained subsystem rather than the core platform, the licensing cost may exceed the benefit. The honest comparison is: what's the cost of a 6-month consulting engagement versus a year of Watsonx licensing plus the productivity gain? For most large banks and government agencies running mainframes as primary infrastructure, the math favors Watsonx. For an insurance company with 200K lines of COBOL for one legacy module, it probably doesn't.
How do we handle copybooks and macros?
Resolve them before any extraction begins — this is non-negotiable. A COBOL program that references a copybook is incomplete without it. The extraction agent needs the expanded form to trace field references and data movements correctly. Build a pre-processing step in your indexer that resolves all COPY statements and inline-expands the copybook content. Tag expanded sections with their source copybook name so citations remain traceable. The same principle applies to JCL symbolic substitution and RPG /COPY directives.
When is the comprehension good enough to start the rewrite?
When the behavioral test suite is green against the legacy system and covers the critical branches identified in the business rule catalog. That's the objective criterion. The subjective criterion — 'we feel like we understand it well enough' — is how teams end up discovering requirements in production. The test suite gives you a falsifiable definition of 'good enough.' If the new system passes the suite and all divergences are reviewed and approved, you have evidence the comprehension was sufficient. If you start the rewrite before the suite is green, you're betting on your own comprehension, which is the exact problem you were trying to solve.
Legacy Comprehension Program Checklist
Production source confirmed — not an archived version that diverged from production
All copybooks / /COPY members / data structure definitions inventoried and indexed
Call graph mapped — entry points and utility programs identified
Five to ten most business-critical programs prioritized with stakeholder input
Retrieval indexer validated: querying a known field name returns the correct paragraph
Extraction run twice on the same source — outputs match (determinism confirmed)
Every extracted business rule has a source file:line citation
Domain expert has validated each rule in the knowledge base against its citation
Data lineage map built for fields feeding critical financial calculations or state transitions
Behavioral test cases generated from business rule catalog and branch inventory
Test suite green against the live legacy system before any rewrite work begins
Interface specification for strangler fig migration derived from legacy entry points (not documentation)
The reason most modernizations fail isn't the technology. It's that the system was rewritten before it was understood. Confidence filled the comprehension gap — developer confidence, consultant confidence, project manager confidence that the legacy system was simpler than it looked. It was never simpler than it looked. Thirty years of business logic, edge cases, and undocumented behavior accumulated in that COBOL because the business kept changing and the system kept being patched to match. Read first. Map the rules. Generate the tests. Then decide whether to rewrite.
The extraction patterns in this piece are not new research — they're disciplines that careful teams have applied to legacy systems for years, now executable at speed with AI comprehension agents. Gartner's projection that 40% of legacy modernization projects will incorporate AI-assisted reverse engineering by 2026 — up from under 10% in 2023 — reflects a genuine shift in what's practical.[10] The tools are good enough. The constraint is the discipline to use them correctly: ground every claim, validate every rule, and don't mistake a fast extraction for a correct one.
- [1]The Stack: There's Over 800 Billion Lines of COBOL in Daily Use(thestack.technology)↩
- [2]IBM: What Is COBOL Modernization? — 240 Billion Lines Active, 5 Billion Written Annually(ibm.com)↩
- [3]Precisely: 9 Mainframe Statistics — 71% of Fortune 500 Companies Use Mainframes(precisely.com)↩
- [4]MetaIntro: The $3 Trillion Code Nobody Knows How to Fix — COBOL Developer Shortage 2026(metaintro.com)↩
- [5]Integrative Systems: Why COBOL Programmers Are Still in Demand in 2026 — Average Developer Age 55(integrativesystems.com)↩
- [6]IBM: watsonx Code Assistant for Z — Mainframe COBOL Modernization(ibm.com)↩
- [7]IBM Research: Watsonx Code Assistant for Z — 79% Reduction in Time to Understand COBOL Applications (NOSI Case Study)(research.ibm.com)↩
- [8]AWS: Amazon Q Developer Transform — Java Legacy Modernization(aws.amazon.com)↩
- [9]HackerNoon: AI Agents vs. COBOL — How Legacy Mainframes Are Being Reverse-Engineered at Scale(hackernoon.com)↩
- [10]SoftwareSeni: Cutting Legacy Reverse Engineering Time by 66% with AI Code Comprehension(softwareseni.com)↩
- [11]EPAM: Mainframe Modernization ROI — A Cost-Focused Guide, Average Annualized Savings $23.3M(epam.com)↩
- [12]GitHub Blog: How GitHub Copilot and AI Agents Are Saving Legacy Systems(github.blog)↩
- [13]Fujitsu: Generative AI Service That Analyzes COBOL Source Code and Automatically Generates Design Documents (March 2026)(global.fujitsu)↩