Most AI articles are written as though you are starting from a blank slate. You are not. You have a 30-year-old ERP that no single engineer fully understands, a payroll system that runs on a server nobody is allowed to restart, and a CFO who is willing to believe in AI when they see it working in production — not before. AI legacy system coexistence is the real state for the majority of enterprise AI projects, and yet almost nothing is written about it specifically.
The vendor-written content assumes you are greenfield. The migration-focused content assumes deterministic systems where a failed deployment triggers a 500 and you roll back in five minutes. AI is different: it can produce output that looks correct and is subtly wrong. It hallucinates. It drifts over time. The failure modes are not server errors — they are silent quality degradation. Standard rollback playbooks do not apply.
This is the operational guide for the 12–18 months most AI teams live in but nobody writes about. Five patterns, borrowed and adapted from microservices migration. Routing rules. Fallback triggers. Kill criteria. And the decommission checklist you will wish you had built on day one.
Why Coexistence Is the Real State, Not the Exception
The middle state is harder than either side and lasts longer than anyone budgets for
Vendor content frames AI as a replacement — you adopt the new thing and the old thing goes away. Enterprise reality is more uncomfortable. Replacing a legacy system that touches payroll, claims processing, or order management is not a sprint. It is a multi-year program, and the politics alone can extend the timeline by six months on either end.
Migration content, meanwhile, is written for deterministic systems where correctness is binary. A database migration either succeeds or it doesn't. A service rewrite either returns the right HTTP codes or it doesn't. AI systems break this assumption entirely. An AI workflow can return a plausible-sounding answer that is factually wrong, inconsistently formatted, or right on average but catastrophically wrong in edge cases you didn't test for. You cannot monitor your way to safety by watching error rates alone.
The MIT finding that roughly 95% of enterprise generative AI pilots are failing[5] is, in part, a coexistence failure. Teams run proofs of concept in isolation, hit the integration wall when they try to connect to real production systems, and quietly shelve the project. The teams that get through are the ones that treat coexistence as a first-class architectural concern — not an afterthought between "pilot" and "full deployment."
Five Coexistence Patterns Borrowed and Adapted
These come from microservices migration — the adaptation is in the failure modes and rollback signals
The strangler fig, sidecar, and parallel run patterns were all established in the microservices migration playbook. They transfer to AI coexistence, but with a critical modification: the rollback signal is different. In microservices, you roll back when error rates spike or latency exceeds SLO. In AI, the system can be silently producing poor output with no error rate signal whatsoever. Rollback must trigger on eval scores, not on HTTP 500s.
The table below maps each pattern, where it fits best, its specific failure mode in an AI context, and what signal actually tells you to revert.
| Pattern | Best for | Failure mode | Rollback signal |
|---|---|---|---|
| Strangler Fig | Well-bounded features with low blast radius — a single API endpoint, a classification task, a document summarization step | Picking a feature that looks bounded but has hidden dependencies on legacy state — the AI output breaks downstream logic it was never tested against | Downstream error rate spike OR eval score for the strangled feature drops below threshold for 3 consecutive evaluation windows |
| Sidecar | High-risk workflows where you need to build trust before exposing AI output to users — fraud detection, medical triage, credit risk | Measuring the wrong things in shadow mode; building false confidence from metrics that don't reflect actual user impact | Divergence rate between AI and legacy output exceeds the pre-agreed tolerance (e.g., >5% mismatch on critical decision fields) |
| Parallel Run | Any AI workflow where you cannot afford to be wrong about quality — start here by default | Running AI in shadow mode but not actually grading it; teams read Slack notifications for a week and then stop checking | AI fails to achieve the pre-defined graduation criterion within the agreed window — e.g., eval parity with legacy for 14 consecutive days |
| Dual-Write | Migrations where the data store itself is changing alongside the application logic — moving from SQL to a vector store, from an ERP table to an event stream | Writing to two systems but only reading from one, then discovering the unread system is months out of sync when you need it | Reconciliation job reports >0.1% divergence between stores on any critical entity type for two consecutive reconciliation runs |
| Feature Flag with Eval-Based Rollback | Any user-facing AI feature that needs a canary before full rollout — the standard deployment unit for production AI | Treating the rollback criterion as a 500-error rate the way you would for a traditional service; missing silent quality degradation until user complaints arrive | Eval score drops ≥2 points vs. baseline, OR adherence metric falls below the agreed threshold for two consecutive evaluation windows[^6] |
Pattern 1: The Strangler Fig — When You Can Carve Off a Feature
Incremental, low-risk entry — but only if the feature is actually bounded
The strangler fig[1] is the most frequently cited coexistence pattern for good reason: it is the lowest-risk entry point. You identify one feature — a classification step, a document summarization endpoint, an intent detection layer — and route traffic for that specific feature to the AI system while legacy handles everything else. The legacy system is not decommissioned; it is gradually strangled as more features are carved off and handed to AI.
The pattern works best when the feature has a clean input/output contract, no hidden dependencies on legacy internal state, and a low blast radius if it fails. A document summarization feature that feeds into a human review queue has a manageable blast radius. A credit decision feature that feeds directly into an approval workflow does not — even if it looks bounded on a diagram.
The trap is feature selection. Teams consistently pick features that appear bounded but turn out to have invisible dependencies on legacy state. The AI system's output is technically correct but breaks a downstream validation rule that was never documented. The way to avoid this: before starting the strangler, trace every consumer of the legacy feature's output through to production. Any consumer you cannot enumerate is a risk.
Pattern 2: The Sidecar — Watch Without Touching
Build trust before you expose AI output to users; measure everything
The sidecar runs AI alongside the legacy workflow on the same input, but AI output is never returned to the user. Legacy is the source of truth. AI is an observer, building a track record you can inspect before you promote it.
This pattern is the right choice when the workflow is high-stakes and you need weeks or months of evidence before anyone will accept the transition argument. Fraud detection, clinical decision support, credit risk scoring — these are all sidecar candidates. The business case for moving off legacy often requires demonstrating AI accuracy on real production data before any decision-maker will approve the swap.
What to measure in sidecar mode: agreement rate between AI and legacy on the same input, the nature of disagreements (systematic vs. random), latency distribution, and how AI performs on the edge cases legacy handles differently. The measurement plan is the whole point. A sidecar without a structured observation protocol is just running AI somewhere nobody checks. Set up dashboards before you deploy, not after.
Pattern 3: Parallel Run — Both Run, Only One Ships
The default starting point for any AI workflow you cannot afford to get wrong
Shadow mode deployment[3][4] is the pattern where both systems receive the same input, both produce outputs, but only the legacy output is returned to the user. AI output is logged, scored against the legacy baseline, and tracked toward a graduation criterion. When AI consistently meets or exceeds the quality bar, you flip the canary.
The key word is consistently. A single 24-hour window where AI outperforms legacy is not graduation evidence. Fourteen consecutive days of eval parity across all input classes, including the edge cases, is graduation evidence. Define the graduation criterion before you start the parallel run, not once you are impatient to ship.
The most common failure mode: teams run parallel mode but stop seriously reviewing the scores after the first week. The Slack notification fires, nobody checks it, and after two months the team declares the parallel run a success without being able to articulate what the data actually showed. Real parallel run has a named owner and a weekly review ritual.
AI runs in dev or staging only — production data never reaches it
AI output is logged but there is no score attached to the log
Results go to a Slack channel that the team checks when it occurs to them
The graduation criterion is 'when it feels ready' or 'when the PM asks'
AI is running but only on 2% of traffic with no plan to expand
AI runs against live production traffic with the same inputs as legacy
Every AI output is automatically scored against the legacy baseline by eval job
A dashboard tracks the rolling eval score with a visible graduation threshold
Graduation criterion is written down before deployment: e.g., eval parity for 14 consecutive days, P95 latency under 400ms
There is a named engineer who reviews the weekly eval report and can make the graduation call
Pattern 4: Dual-Write — When Storage Is Migrating Too
When the data store moves alongside the application logic, reconciliation becomes the critical path
Dual-write is necessary when you are migrating not just the application behavior but the underlying data store. AI writes to both the new system (a vector store, an event stream, a modern document store) and the legacy system simultaneously. Reads still come from legacy until the new store is validated as complete and consistent.
Change Data Capture (CDC)[7] is usually the backbone of this pattern: a CDC stream captures every change from the legacy store and replays it into the new one. The AI system writes to both layers directly while the CDC handles the reverse flow for anything legacy still updates.
The trap is asymmetric reading. Teams set up dual-write correctly, then make the mistake of promoting the new store to primary before reconciliation is complete. They discover months of drift only when a downstream job queries the new store and finds records that were never fully migrated. The rule: do not switch read primary until at least one full reconciliation cycle has passed with zero critical divergences. Write the reconciliation job before you write the dual-write logic — it is easier to define correctness early than to discover what 'correct' means when you are debugging production data mismatches at 2am.
Pattern 5: Feature Flag with Eval-Based Rollback
Standard canary deployment, but the rollback criterion is an eval score drop — not a server error
Feature flag canary deployment for AI works exactly as it does for traditional services — with one non-negotiable modification. The rollback criterion cannot be an error rate alone.
A well-behaved AI system can produce consistently wrong output with a 0% error rate. The HTTP 200s are flowing, latency is fine, and the model is confidently returning plausible-sounding nonsense on a class of inputs you did not test adequately. Traditional rollback tripwires will never fire.
The eval-based rollback signal should be explicit and automatic. Define a minimum eval score before you open the canary. Configure an automated check that runs the eval every N hours against the live canary traffic sample. If the score drops below threshold for two consecutive windows, the flag flips back automatically — no on-call intervention required[6]. The rollback path must be tested in production before the canary opens. A rollback path that has never been exercised is not a rollback path; it is a theory.
Routing and Fallback: How to Decide Live
The router is the load-bearing component — it needs to know AI confidence, eval recency, and blast radius
The router is not a simple A/B split. A production-quality coexistence router needs to incorporate at least three signals: the AI system's confidence score on the specific request, the recency of the last passing eval (stale evals are a risk signal), and the blast radius of this particular request type. A misclassified document summary has a different blast radius than a misclassified fraud decision — the router should know the difference and be more conservative with high-stakes request types even when overall AI quality is high.
Live routing rules that work
- ✓
Route to AI only when model confidence exceeds your pre-defined threshold for this request class — confidence below threshold falls through to legacy automatically
- ✓
Check eval freshness before routing: if the last eval run is older than 24 hours, treat AI as unvalidated and route to legacy until eval is current
- ✓
Cap blast radius per routing decision — requests with downstream financial, legal, or irreversible consequences get a lower AI traffic percentage than informational ones
- ✓
Ensure on-call has a one-command toggle to force all traffic to legacy without a deploy — the kill switch must be documented and tested quarterly
- ✓
Log every routing decision with the confidence score, eval timestamp, and which path was taken — this data is the audit trail if a business stakeholder asks why something was routed to AI
Routing anti-patterns
Routing on user ID — introduces systematic bias where specific user cohorts receive systematically worse AI quality without anyone noticing
Routing on time of day — shifts AI traffic away from peak hours and builds confidence on unrepresentative distribution; edge cases appear only when you go to full traffic
Routing without logging — you cannot diagnose quality degradation, attribute errors, or pass an audit without a full routing decision log
Routing with no tested rollback — the rollback path that has never been exercised in production is a liability, not a safeguard
Single global toggle for all features — flipping one toggle that reverts every AI-touched workflow simultaneously is too coarse-grained; you need per-feature flags with independent rollback
Data Sync: The Part That Eats Your Timeline
Most coexistence projects fail not because of AI quality, but because data sync is harder than anyone estimated
Data sync is where optimistic timelines collapse. The AI system needs data the legacy system owns, in a format the AI system can consume, updated at a cadence the AI workflow requires. Legacy systems were not designed with any of those requirements in mind.
Change Data Capture is the right tool for real-time sync from legacy databases[7]. A CDC stream captures row-level changes from the source database and streams them downstream — into the AI system's data store, into a Kafka topic, into a feature store. The problem is CDC setup on legacy databases is often non-trivial: the legacy schema may lack the primary key constraints CDC tools require, the database may not have binary logging enabled, and the operations team may not be comfortable enabling it on a system that nobody is allowed to restart.
Eventual consistency is not the same as approximate consistency. Define acceptable lag per data type before you build. Financial totals may require near-real-time sync. Reference data (product catalog, user profiles) may tolerate 15-minute lag. Audit records may require strict ordering guarantees. Writing these tolerance requirements down before building the sync layer saves weeks of rework. The reconciliation job is your safety net: a scheduled process that compares record counts and critical field checksums between legacy and the new store, alerting when divergence exceeds tolerance. Build it first.
Decommission Criteria: When Is It Safe to Retire the Legacy?
Running in parallel for six months does not mean it is safe to decommission — these are different decisions
Decommission is its own milestone, separate from graduation. A pattern that has been running successfully in canary for six months can still fail decommission criteria if the audit trail is incomplete, the rollback path has never been tested end-to-end, or the regulatory team has not signed off on the AI system's decision trail.
The decommission decision has five dimensions: error parity (does AI match or exceed legacy across all input classes, not just the ones you optimized for?); eval coverage (do your evals cover the full input distribution, including the edge cases legacy was handling silently?); audit trail completeness (can you reconstruct every decision the AI made, with confidence scores, for any arbitrary time window regulators might ask about?); rollback tested in prod (not in staging — in production, with real traffic, at least once); and regulatory acceptance, which in regulated industries often requires a formal sign-off process that takes months by itself.
Decommission the legacy system the week after all five criteria are met — not the week after graduation.
Decommission Readiness Checklist
AI eval scores have met or exceeded legacy baseline across all input classes for at least 30 consecutive days
Eval coverage includes edge cases and low-frequency input types, not just the happy path
Full audit trail of AI decisions is queryable — decision, confidence score, input hash, timestamp — for a minimum of 90 days of production history
Rollback from AI to legacy has been exercised in production at least once, under real traffic, with recovery time documented
Reconciliation job confirms data stores are consistent — divergence below tolerance threshold for 30 consecutive runs
All downstream consumers of the legacy system's output have been tested against AI output and confirmed compatible
On-call runbook for the AI system has been written, reviewed, and practiced — including what to do when evals degrade in the middle of the night
Regulatory and compliance team has formally reviewed and accepted the AI decision trail and rollback capability
Legacy system decommission dependencies are fully mapped — no undocumented consumers remain
Post-decommission monitoring plan is in place — elevated eval frequency and alert thresholds for the 90 days following legacy retirement
What This Looks Like in Your First 90 Days of Coexistence
A grounded starting sequence — before you touch canary, before you announce timelines
- 1
Pick the smallest workflow worth migrating — and verify it is actually bounded
Choose a workflow with a clean input/output contract, no irreversible downstream effects, and a blast radius you can tolerate if the AI is wrong 20% of the time. Trace every consumer of the current output before committing. If you find more than five downstream consumers, pick a different workflow.
- 2
Set up parallel run before any user-facing change
Deploy AI in shadow mode against production traffic before the first user sees it. Define the graduation criterion — specific eval score, specific duration, specific edge case coverage — and write it down. Assign an owner who reviews the eval report weekly. The parallel run is not complete until the written graduation criterion is met.
- 3
Wire eval-based rollback before anyone opens the canary
The rollback mechanism must be live and tested before the first user request reaches AI. Automate it: if eval score drops below threshold for two consecutive windows, traffic flips back to legacy without human intervention. Then manually trigger the rollback in a staging environment that mirrors production to confirm it works.
- 4
Define decommission criteria before you start — not when you are ready to ship
The decommission conversation is hard to have once the AI system is in production and everyone wants to declare victory. Define the five criteria (error parity, eval coverage, audit trail, rollback tested, regulatory acceptance) at project kickoff. Put them in the project charter. This also surfaces the regulatory timeline early — which, in many industries, will determine your actual schedule more than any technical factor.
Coexistence Anti-Patterns That Sink the Migration
The failure modes are consistent enough to have names
The Big Bang Cutover
Running in coexistence for months, then switching 100% of traffic to AI overnight because the team is tired of maintaining two systems. The edge cases you didn't encounter in partial traffic appear all at once, the legacy rollback path has been neglected, and you spend the next two weeks in incident response.
The Shadow That Nobody Reads
Running AI in shadow mode with automated logging but no structured review ritual. The scores accumulate in a database. Nobody looks at them. After three months, someone asks 'how is AI doing?' and the answer is 'we haven't actually checked.' The shadow mode produced data but not decisions.
The Two Sources of Truth
Promoting the AI system's data store to primary while legacy still receives writes from other systems — typically from a workflow nobody mapped. Both stores diverge. Downstream systems get different answers depending on which store they query. Reconciliation becomes a full-time job.
The Untested Fallback
The rollback path exists in the architecture diagram and in the feature flag configuration, but has never been exercised in production. When it needs to fire — under pressure, in the middle of an incident — it fails or fires incorrectly. A rollback path that has never been tested in production does not exist.
The Forever-Pilot
AI runs alongside legacy for 18 months because nobody has written the graduation criterion, nobody has the organizational authority to call the decommission, or the regulatory review was never started. The coexistence cost compounds monthly. Define the exit criteria at kickoff or you will be maintaining both systems indefinitely.
Common Questions From Platform Leads
The questions that come up in every coexistence project, usually around month three
How long does coexistence usually take?
Enterprise-scale legacy modernization averages 18-24 months end-to-end, with complex ERP environments running 24-36 months[8]. The AI workflow piece specifically tends to hit the 18-month mark once you factor in parallel run (typically 2-4 months to reach graduation criteria), the organizational acceptance work, and regulatory review if the domain requires it. Budget for 18 months as the baseline. If you finish in 12, you were early — that happens with well-bounded workflows in non-regulated industries. Planning for 6 months and landing at 24 is the more common trajectory.
Can we skip parallel run if our AI passes evals in dev?
No. Dev evals and production traffic are different problems. Dev evals cover the inputs you thought to test. Production traffic surfaces the inputs you did not think to test — and in legacy system coexistence, those are often the edge cases the legacy system was handling silently for years without anyone documenting them. Shadow mode against production traffic is the only way to discover what you missed. Skipping it is how 'AI passed all tests in staging' becomes 'AI is wrong 15% of the time on inputs we never saw before.'
What's the right rollback criterion for non-deterministic systems?
The rollback criterion needs to be based on evaluation outcomes, not error rates. A practical starting point: define a minimum eval score (e.g., quality score ≥ 0.85 against your labeled test set), a minimum agreement rate with the legacy baseline on critical decision fields (e.g., ≥ 95% agreement on fields with downstream financial impact), and a maximum latency threshold (e.g., P95 < 400ms). If any of these are violated for two consecutive automated eval windows — typically every 6 hours — the rollback fires automatically[6]. The exact thresholds depend on your domain; the important thing is that they are defined, documented, and automated before the canary opens.
How do we handle user-visible inconsistency between AI and legacy?
During the canary phase, a user who interacts with AI one session and legacy the next may see different responses to the same query. This is unavoidable and needs to be communicated carefully. The practical approach: either tie the routing decision to the session (all requests in a session go to the same system) or to the user account for the duration of the canary. Never route the same user to different systems within the same workflow session — that creates the most visible inconsistency. Accept that some user-facing inconsistency is unavoidable during migration; the goal is to make it temporary and bounded.
When do we tell the auditor about the AI?
Earlier than feels comfortable, especially in regulated industries. Most regulatory frameworks that touch AI decision-making — financial services, healthcare, insurance — require disclosure of automated decision systems, and that disclosure process has its own timeline that does not compress to fit your sprint schedule. In practice, engage compliance and legal when you move from shadow mode to any canary that could affect a decision with regulatory significance. 'We were just testing' is not a defensible position once a regulated decision has been made by an AI system, even in canary mode.
The middle state is the real state. Every organization running AI in production right now is managing some version of this — a legacy system that cannot be replaced quickly, an AI system that is not yet trusted enough to stand alone, and a transition period that is longer and more expensive than the original project plan suggested. That is not a failure. That is the honest shape of enterprise AI adoption.
Plan for 18 months in coexistence and you will be early. Plan for 6 months and you will spend 24 months explaining why the legacy system is still running. The teams that come out the other side are the ones that treated coexistence as a first-class architectural concern from day one — built the eval harness before the parallel run, tested the rollback path before the canary, and defined the decommission criteria before anyone was tempted to skip them.
- [1]Kai Waehner: Replacing Legacy Systems One Step at a Time — The Strangler Fig Approach(kai-waehner.de)↩
- [2]Microsoft Azure Architecture Center: Strangler Fig Pattern(learn.microsoft.com)↩
- [3]MarkTechPost: Safely Deploying ML Models to Production — Shadow Testing and Canary Strategies (March 2026)(marktechpost.com)↩
- [4]Basalt: Simulation in Shadow Mode — Evaluating AI Safely and Effectively(getbasalt.ai)↩
- [5]Fortune: MIT Report — 95% of Generative AI Pilots at Companies Are Failing (August 2025)(fortune.com)↩
- [6]Duckweave: Canary Calm, Rollback Fast — 12 ML Deployment Patterns (February 2026)(medium.com)↩
- [7]Confluent: What Is Change Data Capture (CDC)?(confluent.io)↩
- [8]ShiftAsia: Legacy System Migration Strategies — The Complete Guide to Execution Patterns(shiftasia.com)↩