Most enterprise AI lives between pilot and replacement. Five patterns for the 12-18 months it actually takes — strangler fig, sidecar, parallel run, dual-write, eval-based rollback — with the rollback signals that catch silent quality drift.
Enterprise migrations average 18-24 months. ERP environments routinely push past 36.[8]
Legacy integration is the failure mode named most often. The pilot works; the connection to real systems does not.[5]
Healthcare and finance case studies put integration tax at 20-30% of total AI implementation cost.[5]
A bounded slice routed to the candidate. Eval gates — not vibes — decide expansion.[6]
Five coexistence patterns with exact routing rules and AI-specific failure modes
Eval-based rollback: what triggers to set, at what thresholds, in what cadence
CDC-backed dual-write: how to set it up, where it breaks, and what lag tolerance to set per data class
The router logic most teams skip — confidence gating, blast-radius classification, eval freshness
Decommission criteria that regulators actually accept — and the order in which to satisfy them
A 90-day starting sequence and the five anti-patterns that kill migrations silently
Most AI articles assume a blank slate. You do not have one. You have a 30-year-old ERP nobody fully understands, a payroll system on a server nobody is allowed to restart, and a CFO who will believe in AI when production proves it — not before. The 2025 MIT report puts roughly 95% of enterprise generative AI pilots in the failure column,[5] and legacy integration is the most cited blocker. Coexistence is the real state of enterprise AI. Almost nothing is written about it directly.
Vendor content writes for greenfield. Migration content writes for deterministic systems where a failure is an HTTP 500 and rollback is a five-minute exercise. AI breaks that frame. The output looks correct and is subtly wrong. It hallucinates. It drifts. The failure mode is silent quality decay, not server errors. Standard rollback playbooks miss it entirely.
This is the operating manual for the 12-18 months your team will actually live in. Five patterns lifted from microservices migration and adapted for non-deterministic output. Routing rules. Fallback triggers. Kill criteria. The decommission checklist you will wish someone had given you on day one.
The middle state is harder than either side and outlasts every plan written about it.
Vendor framing treats AI as replacement: adopt the new thing, retire the old one. Enterprise reality refuses. Legacy systems that touch payroll, claims processing, or order management are not sprint targets. They are multi-year programs, and the politics alone add six months on each end of the timeline.
Migration content assumes deterministic correctness. A database migration succeeds or fails. A service rewrite returns the right HTTP code or it does not. AI invalidates that assumption. An AI workflow can return a plausible answer that is wrong, inconsistently formatted, or correct on average and catastrophic on the edge cases nobody tested. Watching error rates will not save you.
The 95% pilot failure number[5] is, in part, a coexistence failure. Teams run proofs of concept in isolation, hit the integration wall when the system has to talk to real production, and quietly shelf the project. The teams that ship treat coexistence as a first-class architectural concern. Not the gap between 'pilot' and 'full deployment.' The architecture itself.
Martin Fowler's 2024 re-analysis of real strangler fig projects names the concrete failure modes that drive this gap:[9] the facade accumulates logic nobody intended, decomposition happens along technical lines instead of business boundaries so every change touches both systems, and temporary dual-writes become permanent load-bearing infrastructure. None of those are AI-specific. All of them compound when the replacement system is non-deterministic.
The signal that triggers rollback is the only thing that changes — but it changes everything.
Strangler fig, sidecar, and parallel run all started in the microservices migration playbook. They survive the move to AI with one structural change: the rollback signal is different. In microservices, you roll back when error rates spike or latency exceeds SLO. In AI, the system can ship confidently wrong output with a clean error rate. Rollback triggers on eval scores, not on HTTP 500s.
The table maps each pattern to where it fits, the failure mode it carries inside an AI context, and the signal that actually tells you to revert.
| Pattern | Best for | AI-specific failure mode | Rollback signal |
|---|---|---|---|
| Strangler Fig | Bounded features with low blast radius — one API endpoint, a classification call, a summarization step | Picking a feature that looks bounded and is not. AI output passes its contract and breaks downstream logic that nobody documented. | Downstream error rate spike, OR eval score for the strangled feature drops below threshold for three consecutive evaluation windows |
| Sidecar | High-stakes workflows where evidence has to precede exposure — fraud detection, medical triage, credit risk | Measuring agreement instead of impact. Confidence built on metrics that do not track what actually matters to the user. | Divergence rate between AI and legacy on critical decision fields exceeds the pre-agreed tolerance (e.g., >5% mismatch) |
| Parallel Run | Any AI workflow you cannot afford to be wrong about. The default starting point. | Running shadow mode without grading it. Slack notifications fire for a week, then nobody opens them. | AI fails to hit the written graduation criterion inside the agreed window — e.g., eval parity with legacy for 14 consecutive days |
| Dual-Write | Migrations where the data store moves with the application — SQL to vector store, ERP table to event stream | Writing to both stores, reading from one, then discovering months of drift the moment you need the other. | Reconciliation job reports >0.1% divergence on any critical entity type for two consecutive runs |
| Feature Flag + Eval Rollback | Any user-facing AI feature that needs a canary. The standard deployment unit for production AI. | Treating the rollback criterion as a 500 rate. Silent quality decay rides in under the radar until users start complaining. | Automated eval score drops below defined threshold for two consecutive 6-hour windows — flag flips to legacy without human intervention |
Low blast radius is the prerequisite. Most features that look bounded are not.
The strangler fig[1] earns its place as the first pattern because the blast radius is the smallest. Pick one feature — a classification step, a summarization endpoint, an intent detection layer — route its traffic to AI, leave everything else on legacy. The legacy system is not decommissioned. It is gradually starved as more features peel off.
The pattern works when the feature has a clean input/output contract, no hidden coupling to legacy internal state, and a recoverable blast radius if it fails. A summarization feature feeding a human review queue qualifies. A credit decision feature that auto-approves a loan does not, no matter how clean the diagram looks.
Feature selection is the trap. Teams pick features that look bounded by inspecting the happy path. The edge cases — the inputs the old system handles silently because someone added a special case twelve years ago — surface in production. A claims classification endpoint that looks like a simple text classifier turns out to write a flag into a shared database table that three downstream jobs read without anyone documenting the dependency. The AI returns the right classification and the flag never gets written. Nobody notices until the billing batch fails at month end.
The practical test: map every downstream consumer of the feature output before committing. If you find more than five, you are not looking at a strangler fig candidate. You are looking at a platform-level dependency that will drag the full system into scope.
Fowler's 2024 analysis of real strangler fig projects found that the façade router consistently accumulates logic — special cases, error handling, format translation — until it becomes a third system nobody planned to maintain.[9] Keep the façade thin: one job, route the request, return the response. Any business logic that wants to live in the router belongs in the AI service.
AI watches production traffic for weeks before it touches a single user decision.
The sidecar runs AI on the same input as legacy and never returns AI output to the user. Legacy is the source of truth. AI is an observer accumulating a track record you can audit before any promotion conversation.
This is the right call when the workflow is high-stakes and the transition argument requires weeks or months of evidence. Fraud detection, clinical decision support, credit risk scoring — all sidecar candidates. The business case for moving off legacy almost always depends on demonstrating AI accuracy on real production data before a decision-maker will sign off on the swap.
What you measure in sidecar is the whole pattern: agreement rate with legacy on the same input, the structure of disagreements (systematic versus random), and the cases where AI diverges and AI turns out to be right. That last category is the most valuable — it is where you build the upgrade argument.
The measurement trap is real. Teams build a sidecar, define 'agreement rate with legacy' as the primary metric, then discover the agreement number climbing toward 99% while AI is systematically wrong on a minority input class that legacy handled via a rule nobody documented. Agreement with legacy is not quality. It is correlation. Build an independent labeled test set — inputs where you know the correct answer — and track AI against that in parallel. A divergence on the labeled set that the agreement metric misses is the failure mode that ends sidecar projects.[11]
Sidecar has a hidden cost: it runs two full inference paths for every request. Size the sidecar for sustained load, not peak, and run the AI path async or in parallel with the response already shipped. The sidecar path must never increase user-visible latency.
Shadow mode is the default starting point for any AI workflow you cannot afford to be wrong about.
Shadow mode deployment[3][4] runs both systems on the same input and returns only the legacy output to the user. AI output is logged, scored against the legacy baseline, and tracked toward an explicit graduation criterion. When AI consistently meets or beats the bar, you flip the canary.
The operative word is consistently. A 24-hour window where AI outperforms legacy is not graduation evidence. Fourteen consecutive days of eval parity across every input class — including the edge cases — is graduation evidence. Write the graduation criterion before the parallel run starts. Not when the team is impatient to ship.
The failure mode is predictable. Teams launch parallel mode, watch the scores for a week, and stop reading. The Slack notifications fire. The dashboard exists. Nobody opens it. After three months, someone asks 'how is the AI doing?' and the honest answer is 'we have metrics but no one is accountable for them.' Data without a named owner and a review ritual is a sunk cost.
The agree-threshold metric — what percentage of AI outputs match legacy across the same inputs — is a useful starting signal but not the graduation gate. Agree rate can be 97% while AI is systematically wrong on a class of inputs that represents 15% of your user base. The graduation criterion has to include input coverage: run the eval suite across every input class, weight by business impact, and require minimum passing scores within each class, not just overall.[11]
AI runs in dev or staging only — production traffic never reaches it
Output is logged with no score attached
Results land in a Slack channel the team checks when it occurs to them
Graduation criterion is 'when it feels ready' or 'when the PM asks'
AI runs on 2% of traffic with no plan to expand
AI runs on live production traffic with the same inputs as legacy
Every output is scored against the legacy baseline by an automated eval job
A dashboard tracks the rolling eval score against a visible graduation threshold
Graduation criterion is written down before deploy: e.g., eval parity for 14 consecutive days, P95 latency under 400ms
A named engineer owns the weekly eval report and has authority to call graduation
Reconciliation becomes the critical path the moment the data store moves with the logic.
Dual-write is the pattern when you are migrating not just application behavior but the underlying store. AI writes to the new system (vector store, event stream, modern document store) and the legacy system simultaneously. Reads stay on legacy until the new store is validated as complete and consistent.
Change Data Capture[7] is usually the backbone. A CDC stream captures every change from the legacy store and replays it into the new one. The AI system writes to both layers directly; CDC handles the reverse flow for anything legacy still updates.
Asymmetric reading is the trap. Teams set up dual-write correctly and then promote the new store to read primary before reconciliation completes. Months later, a downstream job queries the new store and gets a different answer than the one that went into the report last quarter. Nobody can explain the discrepancy because the transition period is in the past and the reconciliation log was not retained.
Debezium on PostgreSQL is the most common CDC implementation for legacy SQL sources. In well-tuned setups, end-to-end latency from database write to downstream consumer runs in the range of milliseconds to tens of milliseconds.[10] The number that matters more than latency is replication slot lag. PostgreSQL tracks lag in bytes via pg_replication_slots. Production alert thresholds: warning at 1 GB slot lag, critical at 10 GB.[10] If the slot falls behind and is never caught up, you can corrupt the replica or exhaust WAL disk without warning.
Define acceptable lag per data class before you go live. Customer-facing transactional data: sub-second. Reference data that changes weekly: minutes acceptable. Historical audit records: hours acceptable. Teams that set one global lag tolerance miss the class where the tight constraint actually matters.
The CDC setup will hit at least one schema problem. Legacy databases often lack primary keys on some tables — a hard requirement for logical replication. Binary logging may be off. The operations team may refuse to enable it on a restricted server. Budget two to four weeks of schema discovery and pre-work before the first Debezium connector goes live.
Error rate is not the rollback signal. That is the only thing you need to remember about deploying AI to canary.
Feature flag canary works for AI exactly the way it works for traditional services, with one non-negotiable change. Error rate is not the rollback signal.
A well-behaved AI system can produce consistently wrong output with a 0% error rate. The 200s flow, latency stays clean, the model returns plausible nonsense on a class of inputs you did not test adequately. Traditional rollback tripwires never fire.
The eval-based rollback signal is explicit and automatic. Define a minimum eval score before the canary opens. Run an automated check every six hours against the live canary sample. If the score drops below threshold for two consecutive windows, the flag flips back without an on-call ticket.[6] The rollback path itself has to be exercised in production — not staging — before the canary opens. A rollback mechanism that has never run in production does not exist.
Canary progression follows a standard ramp:[6] start at 1% of traffic for 24 hours as a catastrophic regression smoke test. If no hard blockers surface, expand to 5%, then 10%, then 25%, then 50%. Each step requires the eval score to hold for a minimum soak period. A workflow that sees weekly seasonality needs to soak through at least one full cycle before you declare the canary clean.
The session-pinning constraint is one most teams discover late: if the same user hits the AI canary in one request and legacy in the next within the same session, they see different responses to the same query. Pin the routing decision to the session or user account, not the request. Inconsistency within a session is the most visible failure mode during canary.
Most teams implement the router as an A/B split. That is not a coexistence router.
The router is not an A/B split. A production-quality coexistence router takes at least three signals: AI confidence on the specific request, recency of the last passing eval (a stale eval is a risk signal), and the blast radius of the request type. A misclassified document summary and a misclassified fraud decision do not live on the same risk axis. The router has to know the difference and stay conservative on high-stakes traffic even when overall AI quality is high.
A fourth signal — user account tier — matters in B2C contexts. Routing enterprise accounts on contracts to an untested AI path before you have earned trust is a customer risk, not just a technical one.
Route to AI only when confidence clears the per-class threshold. Below threshold falls through to legacy automatically.
Check eval freshness before routing. If the last eval is older than 24 hours, treat AI as unvalidated and route to legacy until eval is current.
Cap blast radius per decision type. Requests with downstream financial, legal, or irreversible consequences get a lower AI traffic share than informational ones.
Give on-call a one-command toggle that forces all traffic to legacy without a deploy. Test the kill switch quarterly, not when the incident hits.
Log every routing decision with confidence score, eval timestamp, and chosen path. That log is the audit trail when a stakeholder asks why the request went to AI.
Pin routing to the session or user account — never route the same user to different systems within the same workflow session.
Routing by random percentage only — ignores confidence and blast radius, treats every request as equal risk.
Expanding canary traffic without re-checking eval scores first. The eval that cleared 5% may not hold at 25%.
Using the router to run A/B experiments. Coexistence routing is about safety, not conversion optimization.
Letting the router accumulate business logic. Every special case in the router makes rollback harder.
The data contract between legacy and AI is never as clean as the architecture diagram suggests.
Data sync is where the schedule unravels. The AI system needs data the legacy system owns, in a format the AI can consume, at a cadence the workflow requires. Legacy systems were not designed against any of those requirements.
Change Data Capture is the right tool for real-time sync from legacy databases.[7] A CDC stream captures row-level changes and ships them downstream — into the AI system's store, a Kafka topic, a feature store. The setup is rarely simple. The legacy schema may lack the primary key constraints CDC tools require. Binary logging may not be enabled. The operations team may refuse to enable it on a server nobody is allowed to restart.
Eventual consistency is not approximate consistency. Define acceptable lag per data class before you go live. Customer-facing transaction data: sub-second. Reference data changing weekly: minutes. Historical audit records: hours. Teams that define one global lag tolerance always pick the wrong one for the data class where tight latency actually matters.
Monitor slot lag, not just pipeline health. On PostgreSQL with Debezium, query pg_replication_slots to see how far each slot is behind the current WAL position.[10] Alert at 1 GB of lag; treat 10 GB as a critical incident. A slot that falls too far behind and is never cleaned up can exhaust WAL storage without warning, taking down the source database — not just the pipeline. Set max_slot_wal_keep_size as a safety net and document the emergency slot cleanup procedure before you need it.
The data format gap is a separate problem from the sync problem. Legacy systems output fixed-width records, proprietary binary formats, or normalized SQL rows with decades of implicit schema assumptions. AI systems want JSON with explicit typing, denormalized records, or embeddings. Transformation belongs in the pipeline, not in the AI service. When the legacy schema changes — and it will — you want to update one transformation job, not six downstream AI services.
A system that passes eval for six months can still fail decommission — and often does.
Decommission is a separate milestone. A pattern that has been clean in canary for six months can still fail decommission if the audit trail is incomplete, the rollback has never been tested end-to-end, or the regulatory team has not signed off on the AI decision trail.
The decision has five dimensions. Error parity — does AI match or beat legacy across every input class, not just the ones you optimized for? Eval coverage — do your evals cover the full input distribution, including the edge cases legacy was handling silently? Audit trail completeness — can you reconstruct every decision the AI made, with confidence scores, for any time window a regulator might ask about? Rollback tested in prod — not in staging. In regulated domains, the regulator may require documented evidence of a live rollback exercise before accepting the decommission application. Regulatory sign-off — in financial services, healthcare, and insurance, an AI system rendering regulated decisions needs formal acceptance from compliance before the legacy system goes offline.
The teams most aggressive about early decommissioning typically ended up running legacy the longest. Pressure to declare victory drives premature cutover. Premature cutover drives incidents. Incidents drive legacy back online. Organizations that shipped a realistic decommission timeline at kickoff — and defended it against schedule pressure — typically arrived on time.
Regulatory engagement should start when the canary opens, not when you want to decommission. In many jurisdictions, formal review of an AI decision system takes two to six months. Starting that conversation on the day someone files a decommission ticket means you are maintaining both systems for another quarter minimum.
A decision heuristic you can apply to your current situation in five minutes.
| Your situation | Start with | Skip if |
|---|---|---|
| Single bounded feature, max 3 downstream consumers, reversible if wrong | Strangler Fig | You cannot name every downstream consumer of the feature output |
| High-stakes workflow (financial, clinical, legal) where you need months of evidence before any user exposure | Sidecar | You don't have a labeled test set independent of legacy — agreement rate alone is not quality evidence |
| Any workflow you cannot afford to be wrong about, or your first AI deployment into production | Parallel Run | You don't have an eval harness built and a named person accountable for reading the scores |
| The data store itself is migrating — moving from SQL to a vector store, event stream, or modern document DB | Dual-Write + CDC | You haven't done schema discovery on the legacy source — missing PKs and disabled binary logging will block the connector |
| Any user-facing AI feature going to canary after passing shadow mode | Feature Flag + Eval Rollback | Your rollback mechanism has never been tested in production — never open a canary without a verified rollback path |
The starting sequence — before you touch canary, before you commit to a timeline.
Choose a workflow with a clean input/output contract, no irreversible downstream effects, and a blast radius you can absorb if AI is wrong 20% of the time. Trace every consumer of the current output before you commit. Five or more downstream consumers means pick a different workflow.
Deploy AI in shadow mode against production traffic before the first user sees it. Write the graduation criterion — specific eval score, specific duration, specific edge case coverage — and put a name next to it. The parallel run is not complete until that written criterion is met.
The rollback mechanism has to be live and tested before the first user request reaches AI. Automate it: if the eval score drops below threshold for two consecutive windows, traffic flips back without a human. Then trigger the rollback manually in a production-equivalent environment to confirm it works.
The decommission conversation is hard once the AI system is live and everyone wants to declare victory. Define the five criteria — error parity, eval coverage, audit trail, rollback tested, regulatory acceptance — at project kickoff. Put them in the charter. This also surfaces the regulatory timeline early. In many industries, that timeline drives the actual schedule more than any technical factor.
The failure modes repeat across organizations. They have names now.
Coexistence runs for months, then 100% of traffic flips to AI overnight because the team is tired of maintaining two systems. The edge cases that hid inside partial traffic surface at once, the legacy rollback path has rotted, and the next two weeks are incident response.
Shadow mode runs with automated logging and no review ritual. Scores accumulate. Nobody opens the dashboard. After three months, someone asks 'how is AI doing?' and the honest answer is 'we have not actually checked.' Data without decisions is a sunk cost.
AI's data store gets promoted to primary while legacy still receives writes from a workflow nobody mapped. Both stores diverge. Downstream queries return different answers depending on which store they hit. Reconciliation becomes a permanent role.
The rollback path lives in the architecture diagram and the feature flag config and has never been exercised in production. When it has to fire — under pressure, mid-incident — it fails or fires incorrectly. A rollback path that has never run in production does not exist.
AI runs alongside legacy for 18 months because nobody wrote the graduation criterion, nobody has the authority to call the decommission, or the regulatory review never started. The coexistence cost compounds monthly. Define the exit criteria at kickoff or you maintain both systems indefinitely.
The conversations every coexistence project has, usually mid-canary.
How long does coexistence usually take?
Enterprise legacy modernization averages 18-24 months end-to-end. Complex ERP environments run 24-36.[8] The AI workflow piece tends to land near 18 months once you fold in parallel run (typically 2-4 months to graduation), organizational acceptance work, and regulatory review where the domain requires it. Budget 18 months as the baseline. Finishing in 12 means you were early — that happens with bounded workflows in non-regulated industries. Planning for 6 and landing at 24 is the more common trajectory.
Can we skip parallel run if our AI passes evals in dev?
No. Dev evals and production traffic are different problems. Dev evals cover the inputs you thought to test. Production traffic surfaces the inputs you did not — and in legacy coexistence, those are the edge cases legacy was handling silently for years without anyone documenting them. Shadow mode against production traffic is the only way to find what you missed. Skipping it is how 'AI passed all tests in staging' becomes 'AI is wrong 15% of the time on inputs we never saw before.'
What is the right rollback criterion for non-deterministic systems?
Rollback fires on evaluation outcomes, not error rates. A working starting point: a minimum eval score (e.g., quality score >= 0.85 against a labeled test set), a minimum agreement rate with legacy on critical decision fields (e.g., >= 95% on fields with downstream financial impact), and a maximum latency threshold (e.g., P95 < 400ms). Two consecutive eval windows below threshold — typically every 6 hours — flip the rollback automatically.[6] The exact thresholds are domain-specific. The point is that they are written down and automated before the canary opens.
How do we handle user-visible inconsistency between AI and legacy?
During canary, a user who hits AI one session and legacy the next sees different responses to the same query. That is a real cost and has to be managed. The practical move: pin the routing decision to the session (every request in a session goes to one system) or the user account for the duration of the canary. Never route the same user to different systems inside the same workflow session — that produces the most visible inconsistency. Some user-facing inconsistency during migration is unavoidable. The goal is to bound it and end it.
When do we tell the auditor about the AI?
Earlier than feels comfortable, especially in regulated industries. Frameworks that touch AI decision-making — financial services, healthcare, insurance — require disclosure of automated decision systems, and the disclosure timeline does not compress to fit your sprint cadence. The practical line: engage compliance and legal when you move from shadow mode to any canary that could affect a regulated decision. 'We were just testing' is not a defensible position once a regulated decision has been rendered by an AI system, even in canary.
What happens if CDC lag spikes during the dual-write migration?
Stop promoting reads to the new store until lag clears. A lag spike means the new store is behind — reads from it return stale data that may diverge from what the AI system wrote. Monitor pg_replication_slots for slot lag in bytes. At 1 GB: investigate. At 10 GB: treat as critical.[10] The most common cause is a table without a primary key stalling the connector, or a burst write that overwhelmed the consumer. Fix the root cause before advancing the migration. Never promote a store to read primary under active lag.
How do we know if the façade router is accumulating too much logic?
The rule from Fowler's 2024 analysis:[9] if you find yourself adding an if statement to the router for a business reason — format translation, special-case handling, customer tier routing — the logic belongs in the AI service or the upstream caller, not in the router. The router's job is one: look at the request, decide which system handles it, pass it through unchanged. Every business rule in the router is technical debt that survives the migration and becomes permanent infrastructure.
The middle state is the real state. Every organization running AI in production right now is managing some version of it: a legacy system that cannot be replaced quickly, an AI system not yet trusted to stand alone, and a transition period longer and more expensive than the original plan. That is not failure. That is the honest shape of enterprise AI adoption.
Plan for 18 months and you will be early. Plan for 6 and you will spend 24 explaining why legacy is still running. The teams that come out the other side treated coexistence as the architecture from day one — built the eval harness before the parallel run, tested the rollback before the canary, and defined decommission criteria before anyone was tempted to skip them.
The non-obvious lesson: the teams most aggressive about decommissioning legacy early ended up running it the longest. The pressure to declare victory drives premature cutover. Premature cutover drives incidents. Incidents drive legacy back online. Teams that set a realistic decommission date and defended it against schedule pressure typically arrived on time.
The middle state is not the gap between pilot and production. It is the production architecture for the next 18 months. Build it that way.
Your team codes 3x faster with AI tools, but lead time is up and deployment frequency is flat. The structural reason, and the four pipeline changes that actually fix it.
Agentic tools push engineering past 2–3x velocity and product definition becomes the binding constraint. Hiring more PMs makes it worse. The fix is a three-tier decision rights model that moves authority to where the information actually lives.
Push automation onto an absent substrate and you get usage numbers without capability. Four layers — Literacy, Sandbox, Playbooks, Feedback Loops — a scored readiness rubric, and the sequencing rhythm that holds after the mandate memo fades.