Enterprise migrations average 18-24 months. ERP environments routinely push past 36.[8]
Legacy integration is the failure mode named most often. The pilot works; the connection to real systems does not.[5]
Healthcare and finance case studies put integration tax at 20-30% of total AI implementation cost.[5]
A bounded slice routed to the candidate. Eval gates — not vibes — decide expansion.[6]
Most AI articles assume a blank slate. You do not have one. You have a 30-year-old ERP nobody fully understands, a payroll system on a server nobody is allowed to restart, and a CFO who will believe in AI when production proves it — not before. The 2025 MIT report puts roughly 95% of enterprise generative AI pilots in the failure column,[5] and legacy integration is the most cited blocker. Coexistence is the real state of enterprise AI. Almost nothing is written about it directly.
Vendor content writes for greenfield. Migration content writes for deterministic systems where a failure is an HTTP 500 and rollback is a five-minute exercise. AI breaks that frame. The output looks correct and is subtly wrong. It hallucinates. It drifts. The failure mode is silent quality decay, not server errors. Standard rollback playbooks miss it entirely.
This is the operating manual for the 12-18 months your team will actually live in. Five patterns lifted from microservices migration and adapted for non-deterministic output. Routing rules. Fallback triggers. Kill criteria. The decommission checklist you will wish someone had given you on day one.
Coexistence Is the Default. Replacement Is the Fantasy.
The middle state is harder than either side and outlasts every plan written about it.
Vendor framing treats AI as replacement: adopt the new thing, retire the old one. Enterprise reality refuses. Legacy systems that touch payroll, claims processing, or order management are not sprint targets. They are multi-year programs, and the politics alone add six months on each end of the timeline.
Migration content assumes deterministic correctness. A database migration succeeds or fails. A service rewrite returns the right HTTP code or it does not. AI invalidates that assumption. An AI workflow can return a plausible answer that is wrong, inconsistently formatted, or correct on average and catastrophic on the edge cases nobody tested. Watching error rates will not save you.
The 95% pilot failure number[5] is, in part, a coexistence failure. Teams run proofs of concept in isolation, hit the integration wall when the system has to talk to real production, and quietly shelf the project. The teams that ship treat coexistence as a first-class architectural concern. Not the gap between 'pilot' and 'full deployment.' The architecture itself.
Five Patterns Borrowed From Microservices, Rewired for Non-Deterministic Output
The patterns transfer. The rollback signals do not. Silent quality drift will not trigger a 500.
Strangler fig, sidecar, and parallel run all started in the microservices migration playbook. They survive the move to AI with one structural change: the rollback signal is different. In microservices, you roll back when error rates spike or latency exceeds SLO. In AI, the system can ship confidently wrong output with a clean error rate. Rollback triggers on eval scores, not on HTTP 500s.
The table maps each pattern to where it fits, the failure mode it carries inside an AI context, and the signal that actually tells you to revert.
| Pattern | Best for | Failure mode | Rollback signal |
|---|---|---|---|
| Strangler Fig | Bounded features with low blast radius — one API endpoint, a classification call, a summarization step | Picking a feature that looks bounded and is not. AI output passes its contract and breaks downstream logic that nobody documented. | Downstream error rate spike, OR eval score for the strangled feature drops below threshold for three consecutive evaluation windows |
| Sidecar | High-stakes workflows where evidence has to precede exposure — fraud detection, medical triage, credit risk | Measuring agreement instead of impact. Confidence built on metrics that do not track what actually matters to the user. | Divergence rate between AI and legacy on critical decision fields exceeds the pre-agreed tolerance (e.g., >5% mismatch) |
| Parallel Run | Any AI workflow you cannot afford to be wrong about. The default starting point. | Running shadow mode without grading it. Slack notifications fire for a week, then nobody opens them. | AI fails to hit the written graduation criterion inside the agreed window — e.g., eval parity with legacy for 14 consecutive days |
| Dual-Write | Migrations where the data store moves with the application — SQL to vector store, ERP table to event stream | Writing to both stores, reading from one, then discovering months of drift the moment you need the other. | Reconciliation job reports >0.1% divergence on any critical entity type for two consecutive runs |
| Feature Flag with Eval-Based Rollback | Any user-facing AI feature that needs a canary. The standard deployment unit for production AI. | Treating the rollback criterion as a 500 rate. Silent quality decay rides in under the radar until users start complaining. | Eval score drops >=2 points vs. baseline, OR adherence metric falls below the agreed threshold for two consecutive eval windows[6] |
Pattern 1: Strangler Fig — When the Feature Is Actually Bounded
Lowest-risk entry. Falls apart the moment 'bounded' turns out to mean 'undocumented dependencies.'
The strangler fig[1] earns its place as the first pattern because the blast radius is the smallest. Pick one feature — a classification step, a summarization endpoint, an intent detection layer — route its traffic to AI, leave everything else on legacy. The legacy system is not decommissioned. It is gradually starved as more features peel off.
The pattern works when the feature has a clean input/output contract, no hidden coupling to legacy internal state, and a recoverable blast radius if it fails. A summarization feature feeding a human review queue qualifies. A credit decision feature that auto-approves a loan does not, no matter how clean the diagram looks.
Feature selection is the trap. Teams pick features that look bounded and turn out to depend on legacy state nobody mapped. The AI output is technically correct and breaks a downstream validation rule that lived in someone's head and a stored procedure from 2009. The discipline that prevents this: trace every consumer of the legacy feature's output through to production before you start. Any consumer you cannot enumerate is a risk surface.
Pattern 2: Sidecar — Watch Without Touching
Build the evidence before you expose anything to users. The measurement plan is the pattern.
The sidecar runs AI on the same input as legacy and never returns AI output to the user. Legacy is the source of truth. AI is an observer accumulating a track record you can audit before any promotion conversation.
This is the right call when the workflow is high-stakes and the transition argument requires weeks or months of evidence. Fraud detection, clinical decision support, credit risk scoring — all sidecar candidates. The business case for moving off legacy almost always depends on demonstrating AI accuracy on real production data before a decision-maker will sign off on the swap.
What you measure in sidecar is the whole pattern: agreement rate with legacy on the same input, the structure of disagreements (systematic versus random), latency distribution, behavior on the edge cases where legacy has its own opinions. A sidecar without an observation protocol is just AI running somewhere nobody checks. Build the dashboards before deploy, not after.
Pattern 3: Parallel Run — Both Run, Only One Ships
The default starting point for any workflow you cannot afford to be wrong about.
Shadow mode deployment[3][4] runs both systems on the same input and returns only the legacy output to the user. AI output is logged, scored against the legacy baseline, and tracked toward an explicit graduation criterion. When AI consistently meets or beats the bar, you flip the canary.
The operative word is consistently. A 24-hour window where AI outperforms legacy is not graduation evidence. Fourteen consecutive days of eval parity across every input class — including the edge cases — is graduation evidence. Write the graduation criterion before the parallel run starts. Not when the team is impatient to ship.
The failure mode is predictable. Teams launch parallel mode, watch the scores for a week, and stop reading. The Slack notification fires; nobody checks. Two months later the team declares the parallel run a success and cannot articulate what the data showed. Real parallel run has a named owner and a weekly review ritual. Without those, parallel run is theater.
AI runs in dev or staging only — production traffic never reaches it
Output is logged with no score attached
Results land in a Slack channel the team checks when it occurs to them
Graduation criterion is 'when it feels ready' or 'when the PM asks'
AI runs on 2% of traffic with no plan to expand
AI runs on live production traffic with the same inputs as legacy
Every output is scored against the legacy baseline by an automated eval job
A dashboard tracks the rolling eval score against a visible graduation threshold
Graduation criterion is written down before deploy: e.g., eval parity for 14 consecutive days, P95 latency under 400ms
A named engineer owns the weekly eval report and has authority to call graduation
Pattern 4: Dual-Write — When the Storage Layer Is Migrating Too
Reconciliation becomes the critical path the moment the data store moves with the logic.
Dual-write is the pattern when you are migrating not just application behavior but the underlying store. AI writes to the new system (vector store, event stream, modern document store) and the legacy system simultaneously. Reads stay on legacy until the new store is validated as complete and consistent.
Change Data Capture[7] is usually the backbone. A CDC stream captures every change from the legacy store and replays it into the new one. The AI system writes to both layers directly; CDC handles the reverse flow for anything legacy still updates.
Asymmetric reading is the trap. Teams set up dual-write correctly and then promote the new store to read primary before reconciliation completes. Months later, a downstream job queries the new store and finds records that were never fully migrated. The rule: do not switch read primary until at least one full reconciliation cycle has passed with zero critical divergences. Write the reconciliation job before you write the dual-write logic. It is easier to define correctness early than to negotiate it at 2am while data mismatches in production.
Pattern 5: Feature Flag with Eval-Based Rollback
Standard canary, with one rule: the rollback signal is an eval score, not an HTTP code.
Feature flag canary works for AI exactly the way it works for traditional services, with one non-negotiable change. Error rate is not the rollback signal.
A well-behaved AI system can produce consistently wrong output with a 0% error rate. The 200s flow, latency stays clean, the model returns plausible nonsense on a class of inputs you did not test adequately. Traditional rollback tripwires never fire.
The eval-based rollback signal is explicit and automatic. Define a minimum eval score before the canary opens. Run an automated check every N hours against the live canary sample. If the score drops below threshold for two consecutive windows, the flag flips back without an on-call ticket.[6] The rollback path itself has to be exercised in production before the canary opens. A rollback path that has never been tested in production is not a rollback path. It is a hypothesis.
The Router Is the Load-Bearing Component
Live routing has to know AI confidence, eval recency, and blast radius — not just a flag.
The router is not an A/B split. A production-quality coexistence router takes at least three signals: AI confidence on the specific request, recency of the last passing eval (a stale eval is a risk signal), and the blast radius of the request type. A misclassified document summary and a misclassified fraud decision do not live on the same risk axis. The router has to know the difference and stay conservative on high-stakes traffic even when overall AI quality is high.
Live routing rules that hold under load
- ✓
Route to AI only when confidence clears the per-class threshold. Below threshold falls through to legacy automatically.
- ✓
Check eval freshness before routing. If the last eval is older than 24 hours, treat AI as unvalidated and route to legacy until eval is current.
- ✓
Cap blast radius per decision. Requests with downstream financial, legal, or irreversible consequences get a lower AI traffic share than informational ones.
- ✓
Give on-call a one-command toggle that forces all traffic to legacy without a deploy. The kill switch is documented and tested quarterly, not when the incident hits.
- ✓
Log every routing decision with confidence score, eval timestamp, and chosen path. That log is the audit trail when a stakeholder asks why the request went to AI.
Routing anti-patterns
Routing on user ID. Specific cohorts receive systematically worse AI quality and nobody notices until the data scientist runs the slice.
Routing on time of day. Confidence built on off-peak distribution evaporates the moment edge cases arrive at peak load.
Routing without logging. You cannot diagnose quality drift, attribute errors, or pass an audit without a full decision trail.
Routing with no tested rollback. A path that has never run in production is a liability, not a safeguard.
One global toggle for every AI workflow. Reverting every AI-touched feature at once is too coarse. Per-feature flags, independent rollback, no shared switch.
Data Sync Is Where Optimistic Timelines Die
Most coexistence projects fail on data sync, not model quality. Plan tolerance per data type before you build.
Data sync is where the schedule unravels. The AI system needs data the legacy system owns, in a format the AI can consume, at a cadence the workflow requires. Legacy systems were not designed against any of those requirements.
Change Data Capture is the right tool for real-time sync from legacy databases.[7] A CDC stream captures row-level changes and ships them downstream — into the AI system's store, a Kafka topic, a feature store. The setup is rarely simple. The legacy schema may lack the primary key constraints CDC tools require. Binary logging may not be enabled. The operations team may refuse to enable it on a server nobody is allowed to restart.
Eventual consistency is not approximate consistency. Define acceptable lag per data type before you build. Financial totals may need near-real-time sync. Reference data (product catalog, user profiles) may tolerate 15-minute lag. Audit records may need strict ordering guarantees. Writing the tolerance contract before building the sync layer saves weeks of rework. The reconciliation job is the safety net: a scheduled process that compares record counts and critical field checksums between stores and alerts when divergence breaches tolerance. Build it first.
Decommission Is Its Own Decision. Graduation Is Not Enough.
Six months of clean canary does not retire a legacy system. Five criteria do.
Decommission is a separate milestone. A pattern that has been clean in canary for six months can still fail decommission if the audit trail is incomplete, the rollback has never been tested end-to-end, or the regulatory team has not signed off on the AI decision trail.
The decision has five dimensions. Error parity — does AI match or beat legacy across every input class, not just the ones you optimized for? Eval coverage — do your evals cover the full input distribution, including the edge cases legacy was handling silently? Audit trail completeness — can you reconstruct every decision the AI made, with confidence scores, for any time window a regulator might ask about? Rollback tested in prod — not in staging. In production, with real traffic, at least once. Regulatory acceptance — in regulated industries, a formal sign-off process that takes months on its own clock.
Decommission the legacy system the week after every criterion is met. Not the week after graduation.
Decommission Readiness Checklist
AI eval scores match or beat legacy baseline across every input class for at least 30 consecutive days
Eval coverage includes edge cases and low-frequency input types, not just the happy path
Full audit trail of AI decisions is queryable — decision, confidence score, input hash, timestamp — for a minimum of 90 days of production history
Rollback from AI to legacy has been exercised in production at least once, under real traffic, with recovery time documented
Reconciliation job confirms data stores are consistent — divergence below tolerance for 30 consecutive runs
Every downstream consumer of the legacy output has been tested against AI output and confirmed compatible
On-call runbook for the AI system has been written, reviewed, and rehearsed — including what to do when evals degrade at 3am
Regulatory and compliance team has formally reviewed and accepted the AI decision trail and rollback capability
Legacy decommission dependencies are fully mapped — no undocumented consumers remain
Post-decommission monitoring plan is live — elevated eval frequency and alert thresholds for the 90 days following retirement
What This Looks Like in Your First 90 Days
The starting sequence — before you touch canary, before you commit to a timeline.
- [01]
Pick the smallest workflow worth migrating — and prove it is actually bounded
Choose a workflow with a clean input/output contract, no irreversible downstream effects, and a blast radius you can absorb if AI is wrong 20% of the time. Trace every consumer of the current output before you commit. Five or more downstream consumers means pick a different workflow.
- [02]
Stand up parallel run before any user-facing change
Deploy AI in shadow mode against production traffic before the first user sees it. Write the graduation criterion — specific eval score, specific duration, specific edge case coverage — and put a name next to it. The parallel run is not complete until that written criterion is met.
- [03]
Wire eval-based rollback before the canary opens
The rollback mechanism has to be live and tested before the first user request reaches AI. Automate it: if the eval score drops below threshold for two consecutive windows, traffic flips back without a human. Then trigger the rollback manually in a production-equivalent environment to confirm it works.
- [04]
Define the decommission criteria before kickoff — not when the team wants to ship
The decommission conversation is hard once the AI system is live and everyone wants to declare victory. Define the five criteria — error parity, eval coverage, audit trail, rollback tested, regulatory acceptance — at project kickoff. Put them in the charter. This also surfaces the regulatory timeline early. In many industries, that timeline drives the actual schedule more than any technical factor.
Coexistence Anti-Patterns That Kill the Migration
The failure modes are repeatable enough to deserve names.
The Big Bang Cutover
Coexistence runs for months, then 100% of traffic flips to AI overnight because the team is tired of maintaining two systems. The edge cases that hid inside partial traffic surface at once, the legacy rollback path has rotted, and the next two weeks are incident response.
The Shadow That Nobody Reads
Shadow mode runs with automated logging and no review ritual. Scores accumulate. Nobody opens the dashboard. After three months, someone asks 'how is AI doing?' and the honest answer is 'we have not actually checked.' Data without decisions is a sunk cost.
The Two Sources of Truth
AI's data store gets promoted to primary while legacy still receives writes from a workflow nobody mapped. Both stores diverge. Downstream queries return different answers depending on which store they hit. Reconciliation becomes a permanent role.
The Untested Fallback
The rollback path lives in the architecture diagram and the feature flag config and has never been exercised in production. When it has to fire — under pressure, mid-incident — it fails or fires incorrectly. A rollback path that has never run in production does not exist.
The Forever-Pilot
AI runs alongside legacy for 18 months because nobody wrote the graduation criterion, nobody has the authority to call the decommission, or the regulatory review never started. The coexistence cost compounds monthly. Define the exit criteria at kickoff or you maintain both systems indefinitely.
Questions That Surface Around Month Three
The conversations every coexistence project has, usually mid-canary.
How long does coexistence usually take?
Enterprise legacy modernization averages 18-24 months end-to-end. Complex ERP environments run 24-36.[8] The AI workflow piece tends to land near 18 months once you fold in parallel run (typically 2-4 months to graduation), organizational acceptance work, and regulatory review where the domain requires it. Budget 18 months as the baseline. Finishing in 12 means you were early — that happens with bounded workflows in non-regulated industries. Planning for 6 and landing at 24 is the more common trajectory.
Can we skip parallel run if our AI passes evals in dev?
No. Dev evals and production traffic are different problems. Dev evals cover the inputs you thought to test. Production traffic surfaces the inputs you did not — and in legacy coexistence, those are the edge cases legacy was handling silently for years without anyone documenting them. Shadow mode against production traffic is the only way to find what you missed. Skipping it is how 'AI passed all tests in staging' becomes 'AI is wrong 15% of the time on inputs we never saw before.'
What is the right rollback criterion for non-deterministic systems?
Rollback fires on evaluation outcomes, not error rates. A working starting point: a minimum eval score (e.g., quality score >= 0.85 against a labeled test set), a minimum agreement rate with legacy on critical decision fields (e.g., >= 95% on fields with downstream financial impact), and a maximum latency threshold (e.g., P95 < 400ms). Two consecutive eval windows below threshold — typically every 6 hours — flip the rollback automatically.[6] The exact thresholds are domain-specific. The point is that they are written down and automated before the canary opens.
How do we handle user-visible inconsistency between AI and legacy?
During canary, a user who hits AI one session and legacy the next sees different responses to the same query. That is a real cost and has to be communicated. The practical move: pin the routing decision to the session (every request in a session goes to one system) or the user account for the duration of the canary. Never route the same user to different systems inside the same workflow session — that produces the most visible inconsistency. Some user-facing inconsistency during migration is unavoidable. The goal is to bound it and end it.
When do we tell the auditor about the AI?
Earlier than feels comfortable, especially in regulated industries. Frameworks that touch AI decision-making — financial services, healthcare, insurance — require disclosure of automated decision systems, and the disclosure timeline does not compress to fit your sprint cadence. The practical line: engage compliance and legal when you move from shadow mode to any canary that could affect a regulated decision. 'We were just testing' is not a defensible position once a regulated decision has been rendered by an AI system, even in canary.
The middle state is the real state. Every organization running AI in production right now is managing some version of it: a legacy system that cannot be replaced quickly, an AI system not yet trusted to stand alone, and a transition period longer and more expensive than the original plan. That is not failure. That is the honest shape of enterprise AI adoption.
Plan for 18 months and you will be early. Plan for 6 and you will spend 24 explaining why legacy is still running. The teams that come out the other side treated coexistence as the architecture from day one — built the eval harness before the parallel run, tested the rollback before the canary, and defined decommission criteria before anyone was tempted to skip them.
The non-obvious lesson: the teams most aggressive about decommissioning legacy early ended up running it the longest. The pressure to declare victory drives premature cutover. Premature cutover drives incidents. Incidents drive legacy back online. Teams that set a realistic decommission date and defended it against schedule pressure typically arrived on time.
The middle state is not the gap between pilot and production. It is the production architecture for the next 18 months. Build it that way.
- [1]Kai Waehner: Replacing Legacy Systems One Step at a Time — The Strangler Fig Approach(kai-waehner.de)↩
- [2]Microsoft Azure Architecture Center: Strangler Fig Pattern(learn.microsoft.com)↩
- [3]MarkTechPost: Safely Deploying ML Models to Production — Shadow Testing and Canary Strategies (March 2026)(marktechpost.com)↩
- [4]Basalt: Simulation in Shadow Mode — Evaluating AI Safely and Effectively(getbasalt.ai)↩
- [5]Fortune: MIT Report — 95% of Generative AI Pilots at Companies Are Failing (August 2025)(fortune.com)↩
- [6]Duckweave: Canary Calm, Rollback Fast — 12 ML Deployment Patterns (February 2026)(medium.com)↩
- [7]Confluent: What Is Change Data Capture (CDC)?(confluent.io)↩
- [8]ShiftAsia: Legacy System Migration Strategies — The Complete Guide to Execution Patterns(shiftasia.com)↩