Per Precisely; roughly 72% of global transactional workloads still execute on the mainframe[1]
Hundreds of billions of operations per year across banking, insurance, and retail — the data plane your RAG system needs to read[2]
The most common extraction pipeline. It is also the reason RAG built on top of it answers yesterday's question with today's confidence.
50%
Industry estimate. Higher wherever EBCDIC conversion, copybook parsing, and encoding fixes are still manual — which is most places.
Every mainframe-to-RAG project starts from the same architecture diagram. A box labeled "enterprise data" with an arrow pointing at a box labeled "vector database." Nobody asked what was inside the first box. The AI strategy assumed Postgres or Snowflake. The reality is DB2 on z/OS, IMS hierarchical segments that require COBOL procedural navigation to read, and VSAM files in EBCDIC — a character encoding that predates ASCII's widespread adoption and that has burned every team that assumed a file export would just work.[1]
The bridge between those two realities is the hard problem. Not the embedding model. Not the chunking strategy. The upstream pipeline, its freshness characteristics, and whether the records it hands your vector store are recent enough to be trusted. Everything downstream is decoration on a bad foundation.
This catalog covers the seven patterns operators actually run. Each trades freshness, blast radius, and MIPS cost differently. Most enterprises end up running two or three at once across different data domains. Not from indecision. From the fact that a single pattern applied to a heterogeneous mainframe estate optimizes for the wrong domain every time.
The LLM Was Never the Bottleneck. The Bytes Are.
Three structural forces collapse mainframe-to-RAG projects that nailed everything north of the data layer.
The difficulty is structural. Three forces, and none of them yield to a better model.
The formats are foreign. EBCDIC is not ASCII with a swapped lookup table. It carries different control characters, a different sort order, and special-character handling that quietly corrupts data the moment a team treats the conversion as a one-byte-for-one-byte swap. Packed decimal (COMP-3) stores two digits per byte with a trailing nibble for the sign. Zoned decimal scatters digits across both nibbles with an overpunch sign convention. Convert EBCDIC to UTF-8 before unpacking the numerics and the numbers are wrong. Unpack with the wrong copybook and the numbers are plausible nonsense.[8] Neither failure announces itself until a retrieval result is off in a way nobody can attribute back to the byte stream.
The latency gap is enormous and it lands inside the answer. A nightly extract is T+24h stale by the time embeddings reach the vector store. Account balance, open claim, current inventory — for any domain that moves intraday, T+24h is not a delay. It is a confidently wrong answer that erodes trust faster than having no RAG system at all. Freshness is not a technology preference. It is a property of the data domain.
The people who knew the schema have retired. COBOL copybooks — the byte-level layouts describing what each field of a VSAM record means — are routinely undocumented, inconsistently maintained, or missing entirely for files modified across decades. The data engineer assigned to build the pipeline inherits a system where the only authoritative schema definition is a COBOL program written in 1992, last touched by someone who left in 2009.[7] That is not a documentation gap. That is the default state.
Seven Patterns. Pick by Data Domain, Not by Team Preference.
Each one trades freshness, blast radius, and MIPS cost differently. None is universally correct.
The taxonomy below runs from a nightly batch export to a real-time event stream. None of them is universally correct. The only question that matters: what freshness does this data domain require, and what does meeting that freshness cost in MIPS, latency, and on-call load?
Most enterprises run two or three of these in parallel. Reference data — product codes, regulatory tables, org hierarchies — tolerates nightly refresh and belongs in Pattern 1. Customer account state feeding fraud detection cannot live there. The architecture should encode that asymmetry, not paper over it.
One scar from running these projects: teams that wire up CDC before mapping freshness requirements always regret it. A pilot we sat through configured CDC for 40 DB2 tables before the business owners had reviewed the staleness windows. 35 of the 40 turned out to be reference data with a T+24h budget. Three months of IIDR licensing and MIPS overhead spent on a nightly-refresh problem. Map the freshness requirements first. Pattern selection is downstream of that map.
| Pattern | Freshness | Operational Load | Cost | Where it earns its keep |
|---|---|---|---|---|
| T+24h | Low | Low | Reference data, regulatory snapshots, slow-moving master data |
| T+1–5 min | High | Medium–High | Transactional DB2 tables, customer records, live account balances |
| Real-time query | Medium | High (MIPS) | Ad-hoc reads where copy latency is unacceptable and volume is bounded |
| Sub-minute | High | Medium | Application-layer semantics from CICS and IMS transaction paths |
| Real-time on the write path | Very High | High | Active migration where old and new must stay consistent during the move |
| T+24h (typically) | Medium | Low | Legacy files where the copybook is missing, drifted, or hostile |
| As-of last commit | Low | Low | Schema discovery, data-catalog Q&A, recovering layouts from the code |
Pattern 1: Nightly Export — Honest, Cheap, and Wrong for Anything That Moves
Every mainframe team has this pipeline. A JCL job fires at 02:00, exports DB2 tables or VSAM files to flat files, those land via FTP or MFT, an ETL job picks them up, converts EBCDIC to UTF-8, unpacks the COMP-3 fields, and loads the result into a warehouse or object store. It is honest, cheap, and well-understood.
Where it earns its keep: reference data that moves slowly — product catalogs, regulatory code tables, org hierarchies, postal mappings. Compliance snapshots. Anything where T+24h is acceptable and the operational tax of a real-time pipeline is not justified by the question being asked.
The failure mode is reuse. Account balances, open claims, current inventory, in-flight orders — they change during the business day. A RAG system answering questions about them off last night's snapshot produces confidently wrong answers, with detail. Surface the export timestamp in every retrieval result. If you do not, users will trust the wrong answer longer than they should — and that trust does not come back.
The EBCDIC step gets less attention than it deserves. Cobrix — a COBOL/EBCDIC parser for Spark — handles copybook-driven extraction reasonably well for VSAM files with known layouts. DB2 exports via DSNUTILB UNLOAD are more tractable, but packed decimal still requires explicit handling. Verify numeric field ranges against known-good control records before trusting the pipeline at production volume. Once corrupted embeddings are in the vector store, you do not patch them — you reindex.[8]
Pattern 2: CDC from DB2, IMS, and VSAM — Real-Time Costs Real MIPS
CDC is the production-grade option when the domain demands near-real-time freshness. Instead of taking snapshots, CDC reads the database transaction log — the DB2 BSDS, the IMS OLDS, a VSAM journal — and emits every committed change as a structured event. Inserts, updates, deletes flow downstream with 1–5 minute latency under normal load.
The tools that ship. IBM InfoSphere Data Replication (IIDR) — rebranded IBM Data Replication — is the incumbent for DB2 z/OS CDC, with native IMS and VSAM source support.[3] Qlik Replicate runs a zero-footprint agent that avoids installing software on the mainframe itself, which matters enormously to mainframe operations teams who treat the MIPS budget as a controlled substance.[4] Precisely Connect is the third major option, historically stronger for VSAM and sequential file CDC where IIDR has been weaker.
What CDC captures, and where it goes silent. For DB2 z/OS, archive-log CDC is well-understood and reliable for row-level changes. For IMS, CDC captures segment inserts, updates, and deletes — but reconstructing the hierarchical relationships between parent and child segments requires logic the CDC tool has to get right, and not all of them do for complex PCB structures. For VSAM, journal-based CDC depends on the file being opened for output with journaling enabled, which is not the default. Many VSAM files are written by batch jobs that open them in DISP=OLD without journaling. Those writes are invisible to the CDC agent. Invisible failure modes are the worst kind.
MIPS overhead is real, and the vendor will tell you it is not. Every CDC tool claims minimal mainframe impact. Every mainframe operations team disagrees. Budget for 3–8% MIPS overhead and negotiate it with capacity planning before launch, not after the first invoice cycle.
Pattern 3: Federated Query — Compute Cost Does Not Disappear, It Moves
Federated query virtualization — Denodo, Trino, Starburst — offers a clean promise. Expose mainframe data as SQL without copying it anywhere. The RAG retriever issues a query. The federation layer pushes it to the DB2 subsystem. The result comes back. No replication lag, no schema sync, no embedding rebuild.
Where it earns its keep. Ad-hoc reads on reference data with no sub-second latency requirement. Data-catalog flows where a human analyst spot-checks mainframe records. Reporting paths where query frequency is low and operations has pre-negotiated query windows with the DBAs.
The trap is throughput. Federation does not eliminate compute cost. It moves it onto the mainframe. Every federated query runs on z/OS and consumes MIPS. At RAG retrieval volume, this is not a line item — it is a budget event. A retriever issuing 500 federated queries per minute lights up the MIPS invoice and your mainframe operations team will escalate to your VP before you have a chance to explain the architecture. The blast radius of a hot retrieval path on a federation layer is exactly as large as the MIPS pool you share with the rest of the bank.
The honest framing: federation is not a replacement for replication. It is a complement for low-frequency authoritative reads where the latency of a live mainframe query (typically 50–200ms for a simple indexed lookup) fits inside the retrieval SLA. For high-frequency retrieval paths, replicate the hot data into a modern store and reserve federation for the records that demand a live read or sit behind access controls you cannot mirror.
Pattern 4: Event Sourcing — Application Semantics CDC Cannot See
CDC captures changes at the database layer. Event sourcing captures them at the application layer — CICS transactions, IMS applications, z/OS Connect services emitting business events to a message bus. IBM Event Streams runs Apache Kafka on Linux on IBM Z, letting you co-locate the cluster with the mainframe and shrink cross-LPAR latency to sub-minute delivery from application transaction to downstream consumer.[5]
Why events beat CDC where the question is semantic. A CDC event tells you which DB2 columns changed and to what value. An application-emitted business event tells you why — claim denied because of fraud review, payment reversed on an NSF condition, account flagged by a risk rule. That context does not exist in the database log. For RAG systems answering business questions, the richer event is the difference between an accurate answer and a technically correct one.
The trap is the batch window. Enterprise mainframe estates run overnight batches where COBOL jobs make large-scale updates to DB2 tables and VSAM files through direct file access, bypassing the online transaction layer entirely. Application event sourcing sees none of this. An event-sourced pipeline that covers 100% of CICS transactions may still miss 40% of data changes because the batch window is where the bulk happens. Event sourcing without a CDC fallback for batch coverage is not a complete architecture. It is a half-architecture that produces convincing answers about the half of the world it can see.
Pattern 5: Dual-Write — A Migration Pattern, Not a Steady State
Dual-write belongs to migrations, not steady state. During cutover — while the application is being moved from mainframe storage to a modern store — every write fires to both systems. The RAG pipeline reads from the new system, which is being continuously populated. The mainframe stays authoritative until it does not.
The write-path interception itself is mechanical: dual-issue inside a distributed transaction or with compensating rollback, log divergence on every commit. The reconciliation job is where dual-write actually lives or dies. Every dual-write architecture requires a parallel verification process that continuously compares records across systems and surfaces discrepancies. That job is typically 3–5x the implementation effort of the dual-write itself, and it runs for the entire cutover — which is always longer than anyone planned for.
Dual-write fits when the migration is bidirectional and legacy consumers still need to read from the mainframe while the AI pipeline reads from the modern store. It is not a long-term steady state. The reconciliation overhead, the dual MIPS consumption, and the operational complexity of keeping two systems consistent are costs that compound daily — and the second-order cost of forgetting to retire dual-write after cutover is permanent.
Pattern 6: Schema-on-Read — When the Copybook Is Gone or Lying
Sometimes the copybook is missing. Sometimes it exists but was last updated in 2003 and no longer matches the file. Sometimes the file has been modified by seven COBOL programs across fifteen years and nobody can tell you what byte 47 currently means. The schema is not gone — it has drifted past anyone's ability to vouch for it.
The only defensible move is to export the raw bytes and apply schema-on-read at query time. Heuristic field detection, LLM-assisted structure inference, human-verified sample records — assembled into a probabilistic schema that is good enough for embedding and explicit about its uncertainty.
This is risky, and sometimes it is the only path forward. The mechanism: export a 10,000-record sample, run it through a COBOL layout inference tool (or prompt an LLM with byte-frequency distributions and known field constraints), generate a candidate copybook, validate it against a known-good control set, and carry a schema-confidence metadata field on every embedded record. Retrieval results derived from schema-on-read data must surface their confidence to the user. Hide it and you have built a confident liar.
Never use schema-on-read for financial fields without human verification of every inferred numeric conversion. A packed decimal misread as a character string in a customer balance record is not recoverable by patching the schema later — the wrong embeddings are already in the vector store, and corrupt embeddings do not get corrected. They get reindexed.
Pattern 7: RAG Over the COBOL Itself — The Source Is the Schema
Most teams miss this one. It is also the cheapest to ship and the fastest to repay its cost. The COBOL source — programs, copybooks, JCL, PROCs — is the authoritative documentation for what the data means and how it is laid out. Index it into a vector store. When an engineer or a downstream pipeline asks "what does field ACCT-BAL-CURRENT mean in the customer master file," the retriever returns the relevant copybook and the COBOL procedures that read or populate it.
This is the leverage point for the Pattern 6 scenario. When the copybook is missing, querying the COBOL source recovers the layout from the WRITE statements, the MOVE statements, and the FD definitions scattered across multiple programs. It does not replace the engineer who knew the system in 1998. It surfaces the relevant code in seconds instead of hours of grep across a 40-year-old repository — and that compresses the recovery loop from days to minutes.[7]
Nightly snapshot reused for high-velocity transactional data — balances and open claims are stale by 09:00
Application event sourcing with no CDC fallback — overnight batch updates are invisible to the pipeline
Federation deployed without MIPS observability — mainframe query load lands as a surprise on the invoice
Schema-on-read without numeric field verification — misread COMP-3 fields produce plausible wrong numbers silently
COBOL source treated as legacy noise — the schema documentation is sitting in the repo, unindexed
Pattern picked per data domain: CDC for transactional, nightly for reference, schema-on-read where the copybook is gone
Event sourcing for application-layer semantics, CDC as the enforcement fallback for the batch window
Federation reserved for ad-hoc and low-frequency reads, replication for the hot retrieval path
Schema-on-read with mandatory sample verification and a confidence metadata field on every embedded record
COBOL source-code RAG indexed as a parallel layer for schema discovery, developer tooling, and catalog Q&A
The Freshness Budget Decides the Pattern. Not the Architect.
Right pattern follows from what the domain requires. Wrong pattern follows from what the team is comfortable building.
The most common failure mode in mainframe-to-RAG projects is picking one pattern and applying it globally. The mainframe estate is not homogeneous. A DB2 database with 200 tables contains domains with radically different freshness requirements. The product master has 300 records updated once a quarter. The account transaction history has 50 million records updated continuously through the business day. Treating those two as the same problem is how the architecture overspends on the cheap domain and underbuilds the expensive one.
Run the freshness budget exercise before any architecture diagram is drawn. For each candidate data domain, answer three questions: what is the business consequence of serving a stale answer, what is the maximum acceptable staleness window, and what does the pattern capable of meeting that window cost to operate?
Reference data — product codes, regulatory tables, org hierarchies, postal mappings — almost always lands in Pattern 1. The operational tax of a real-time pipeline is not justified by the consequence of a one-day lag. Customer balances, open claims, current inventory, and in-flight transactions belong in Pattern 2 or Pattern 4. The consequence of a wrong balance is a customer escalation or, worse, a fraud loss. Audit logs and immutable historical records belong in Pattern 1 or Pattern 3 — they do not change, so freshness is irrelevant and query cost is the only variable. The freshness budget is the leverage point. Everything else is execution.
The freshness budget exercise
- ✓
Every candidate data domain listed with its owning system — DB2 table, IMS segment, VSAM file, sequential dataset
- ✓
Maximum acceptable staleness window stated in concrete business terms — T+24h, T+1h, T+5min, or real-time
- ✓
Each staleness window mapped to the pattern tier capable of meeting it, with the operational cost recorded
- ✓
Domains updated by batch jobs identified — they require Pattern 2 or dual CDC + event coverage, not event sourcing alone
- ✓
Domains with missing or unreliable copybooks flagged — they require Pattern 6 or preliminary COBOL RAG to recover the layout
- ✓
Staleness windows signed off by the business owners before any pipeline ships, not after the first retrieval complaint
Five Anti-Patterns That End Mainframe Modernization Projects
The Big Bang Replication
Attempting to migrate the entire mainframe estate to a vector store in a single phase. Schema surprises, encoding issues, and reconciliation failures compound across domains simultaneously, producing a project that is permanently 80% done. Migrate one domain at a time, prove the pattern, then expand. The blast radius of a domain-by-domain failure is bounded. The blast radius of a big-bang failure is the entire program.
The Forgotten Reconciliation
Shipping the extraction pipeline without the verification harness that continuously compares the vector store against the source. Without reconciliation, you do not learn the pipeline broke. You learn it from a user asking a question whose answer is obviously wrong. The harness is not optional infrastructure. It is the only mechanism that catches silent failure.
The Hidden Mainframe Cost
Underestimating the MIPS impact of CDC tools, federated queries, or extra logging on the mainframe. Every byte read from the source consumes MIPS. Operations teams account for MIPS at the dollar level. Get capacity-planning approval before the first production query, not after the first invoice cycle.
The Encoding Surprise
Assuming EBCDIC-to-UTF-8 conversion is a solved problem a standard library handles. It is not. EBCDIC has regional variants (EBCDIC-037, EBCDIC-500, EBCDIC-1047), packed decimal fields must be excluded from byte-level conversion, and EBCDIC sort order differs from ASCII in ways that quietly break range queries. Test the conversion against production data before declaring the pipeline production-ready.
Trust Without Verify
Feeding RAG with mainframe data that has not been checked against a known-good control set. A vector store populated with structurally corrupted records — bad encoding, wrong decimal placement, truncated fields — produces retrieval results that look authoritative and are factually wrong. Verify before you embed. After the first complaint is too late — the embeddings are already there.
What the First 90 Days Look Like, in Order
- [01]
Inventory the domains and their freshness requirements first
Work with application and business owners to produce a complete list of candidate domains, source systems (DB2 table, IMS segment, VSAM dataset, COBOL-generated flat file), and freshness budgets. This is the only artifact that makes every downstream pattern choice defensible. Skip it and the architecture is a guess.
- [02]
Stand up Pattern 7 before any other pipeline
Index the COBOL programs, copybooks, and JCL into a vector store. Operationally near-free, immediate developer leverage for schema discovery, and the foundation for recovering schema information in Pattern 6 domains. It also forces the team to build the embedding and retrieval infrastructure on a dataset that does not pressure the vector store yet.
- [03]
Run the first Pattern 1 export on two high-value reference domains
Pick two reference domains with clean T+24h or looser budgets. Run the full pipeline — EBCDIC conversion, COMP-3 unpacking, field validation, embedding, vector store load — and verify against known-good control records before any retrieval query is enabled in production. The first production query is too late to discover the encoding bug.
- [04]
Build the verification harness before CDC or event sourcing land
The reconciliation job that continuously compares vector store records against the mainframe source and alerts on divergence must exist before Pattern 2 or Pattern 4 is introduced. CDC and event sourcing fail silently. The verification harness is the only mechanism that catches it. Logging that records 'pipeline succeeded' is not observability — it is an alibi.
The Questions Operators Actually Ask
Can we skip CDC and just snapshot nightly?
For reference data and historical records, yes — often the right call. For transactional data where the business consequence of a stale answer is material (account balances, open claims, current inventory), no. The decision belongs to the domain's freshness budget, not to the team's preference for simplicity. Nightly snapshots over high-velocity data produce a RAG system that confidently answers questions with yesterday's facts. That is worse than no system, because the wrongness is not visible.
How do we handle EBCDIC and packed decimal correctly?
Treat the conversion as two phases, in order. First, identify every COMP-3 (packed decimal) or COMP (binary) field from the copybook and exclude them from the EBCDIC-to-UTF-8 byte-level conversion. Second, unpack those numeric fields with a COBOL-aware parser before the character conversion runs on the remaining bytes. Cobrix (open source, Apache Spark) handles this well for VSAM with known copybooks. For DB2 UNLOAD output, IBM's DSNUTILB utility produces delimited ASCII directly, which sidesteps most of the encoding surface. Test both against known-good control records before going live.
Federation or replication?
Federation (Denodo, Trino, Starburst) when query frequency is low, freshness must be absolute (zero replication lag), and MIPS cost per query is acceptable. Replication (CDC or nightly) when query frequency is high, the retrieval SLA is tight, and MIPS cost needs to be bounded. The pattern that ships in production is usually both: replicate the hot path into a modern store, keep federation as a fallback for records that have not been replicated yet or that demand an authoritative live read.
Where does IBM Watsonx Data fit?
Watsonx Data is a Presto/Iceberg lakehouse that pairs with IBM Data Gate to expose federated query access to Db2 for z/OS without a full copy. Reasonable for teams already inside the IBM ecosystem who want federation with better performance than a general-purpose Trino deployment. The MIPS cost still applies — Data Gate queries still execute on the mainframe — but the governance and lineage tooling is more mature than standalone Trino over mainframe sources.
When is the migration actually done?
When the verification harness shows less than 0.1% record divergence between vector store and mainframe source for 30 consecutive days, the retrieval latency SLA holds at p99 under production load, and the business owners have signed off on the freshness windows for every domain. 'Done' is not a technology milestone. It is a trust milestone. The verification harness is how you earn that trust systematically — not by asking stakeholders to take the architecture on faith.
Pre-Production Readiness for a Mainframe-to-RAG Pipeline
Every candidate data domain inventoried with its source system — DB2, IMS, VSAM, sequential dataset
Freshness budget stated per domain in concrete business terms — T+24h, T+5min, real-time — not in adjectives
MIPS overhead of CDC tooling pre-approved by mainframe capacity planning, in writing
Every copybook identified; missing or drifted ones flagged for schema-on-read or COBOL RAG recovery
EBCDIC variant confirmed per source file — EBCDIC-037, EBCDIC-500, EBCDIC-1047 — never assumed
COMP-3 and COMP fields identified and excluded from byte-level encoding conversion
Batch-window coverage confirmed — domains updated by batch jobs have CDC or reconciliation, not event sourcing alone
Verification harness running before any production retrieval query is enabled, not after
COBOL source indexed into a vector store for schema discovery and developer-tooling Q&A
Schema-on-read domains carry a confidence metadata field on every embedded record
Retrieval results sourced from batch pipelines surface the export timestamp to the end user
Pattern 1 exports validated against known-good control records before any production embedding runs
The hardest part of enterprise AI is not the LLM. It is not the embedding model, the chunk size, or the choice between cosine and dot-product similarity. It is the half-mile between a 1985 DB2 schema and a 2026 vector database — the EBCDIC conversion logic nobody documented, the packed decimal fields that corrupt silently, the batch jobs that update half the records in the system without touching the application event layer, and the copybooks last maintained by someone who retired before the iPhone existed.
Every one of those problems is solvable. None of them is solved by picking a better retrieval algorithm. Plan the bridge before you talk about agents. Build the verification harness before you trust retrieval results. Pick the pattern that matches the domain's freshness budget, not the one that looked simplest in the architecture presentation. The teams that get this right treat the data pipeline as the product. The teams that get it wrong discover, six weeks after go-live, that their RAG system is confidently answering questions about a world that no longer exists.
- [1]Precisely: 9 Mainframe Statistics That May Surprise You — Fortune 500 mainframe adoption data(precisely.com)↩
- [2]BMC Software: State of the Mainframe in 2025 — mainframe workload and transaction statistics(bmc.com)↩
- [3]IBM: Data Replication Change Data Capture (CDC) Best Practices — IIDR architecture and configuration(ibm.com)↩
- [4]Qlik: DB2 Mainframe CDC — Qlik Replicate for DB2 z/OS, IMS, and VSAM data integration(qlik.com)↩
- [5]Kai Waehner: Mainframe Integration with Data Streaming — IBM Z Event Streams and Apache Kafka patterns(kai-waehner.de)↩
- [6]IBM Documentation: InfoSphere Data Replication CDC for DB2 z/OS — technical reference(ibm.com)↩
- [7]Cobrix: COBOL parser and Mainframe/EBCDIC data source for Apache Spark — open source tooling(github.com)↩
- [8]AWS Prescriptive Guidance: Convert and unpack EBCDIC data to ASCII using Python(docs.aws.amazon.com)↩