Train once, control the weights, call it sovereignty. Twelve months later the model is confidently wrong about pricing, policy, and headcount. The playbook for when to retrain, what to retrain on, and how to validate without breaking live agents.
Nadella got the framing right. Most teams that ran with it got the implementation wrong.
At Davos in January 2026 he said it plainly: "If you're not able to embed the tacit knowledge of the firm in a set of weights in a model that you control, by definition you have no sovereignty. That means you're leaking enterprise value to some model somewhere."[1][2] Organizations heard a sprint. Train on internal knowledge. Own the weights. Ship the deck. Done.
Fourteen months later, those teams own a snapshot. The weights they control reflect organizational knowledge from the day they froze the corpus. Since then: three pricing tiers shifted, two product lines died, support workflow got rebuilt, four policy documents changed. The fine-tuned model speaks confidently — and incorrectly — about all of it. Internal agents surface wrong answers with full authority. That is not sovereignty. It is a well-branded liability that compounds quietly until someone files a ticket.
Firm sovereignty is a continuous operating system. The machinery that makes it real has three problems the train-once playbook ignores: deciding when to retrain without burning compute on noise, deciding what to retrain on without triggering catastrophic forgetting, and validating the new version without breaking the live agents that depend on the current one. This playbook covers those three.
Drift detection thresholds: PSI, composite scoring, and the three trigger classes you need all three of
Knowledge half-life: four decay classes, what cadence each demands, when RAG beats fine-tuning
Catastrophic forgetting: why it's a default outcome and the corpus construction rules that prevent it
LoRA vs. full fine-tune: the compute decision matrix with real cost numbers
Production validation: tool call compliance gates, shadow deployment, canary with automated rollback
The twelve-month calendar: what Q1–Q4 actually deliver and why Q1 is the only one that matters
Model registry provenance: what metadata every promoted artifact must carry
Below 0.1 the distribution is stable. Between 0.1 and 0.25 you investigate. Above 0.25 you queue a retraining decision. Threshold borrowed from credit risk, now standard for LLM monitoring.[13]
A 2% baseline refusal rate jumping to 15% is a critical drift event. Response length (30%), uncertainty (20%), and vocabulary novelty (10%) round out the score.
Product catalog, policy, org structure, and strategic frameworks decay on different clocks. One corpus, one schedule produces the wrong cadence for all four.
Standard text quality metrics miss this failure mode entirely. It needs an agent-specific validation layer bolted onto the eval harness.
Conflating fast-decay and slow-decay content into one retraining schedule produces the wrong cadence for everything in the corpus.
Teams on a fixed six-month retraining schedule are usually wrong in both directions at once. Over-training on knowledge that barely moves. Under-training on knowledge that changed last month.
Organizational knowledge decays at four distinct rates. Product specs and pricing logic shift in weeks — a new model release, a deprecated API, a quarterly pricing update. Process and policy documentation moves in months. Org structure and people data turns over across quarters. Strategic frameworks change across years, if at all.
A model fine-tuned on a corpus that mixes all four will drift in ways that look noisy on every dashboard. The degradation is not uniform. The model stays accurate on slow-decay knowledge while diverging sharply on fast-decay content. Run the golden set across all categories and the aggregate score looks fine while product catalog answers go dangerously wrong. The aggregate is the problem hiding the problem.
The knowledge half-life framework forces classification before any retraining pipeline gets built. Each domain gets tagged with a decay class. Fast-decay domains trigger frequent runs, or get handed to a retrieval layer over a live source while fine-tuning carries the stable knowledge. Slow-decay domains drop out of frequent cycles entirely — retraining on them just adds noise and regression risk.
The classification also feeds the audit. Before every retraining run, the audit names which domains have changed since the last checkpoint, by how much, and whether the change clears the threshold to justify a run. Skip the audit and you retrain on everything every time. Expensive, slow, and a free pass for catastrophic forgetting.
| Decay class | Examples | Typical half-life | Retraining approach |
|---|---|---|---|
| Fast-decay | Product catalog, pricing tiers, active projects, headcount | 2-8 weeks | RAG over a live source, or a monthly fine-tuning delta |
| Medium-decay | Process docs, policy manuals, compliance, team topology | 3-6 months | Quarterly fine-tune, event-triggered on major policy changes |
| Slow-decay | Strategic frameworks, architectural principles, company values, industry context | 1-3 years | Annual review. Retrain only when substantive revision is confirmed |
| Volatile-event | M&A, product launches, org restructures, regulatory changes | Point-in-time event | Event-triggered run inside 2-4 weeks of the event |
The wrong question is 'how often should we retrain?' The right question is 'what signal proves a run is worth the cost and the regression risk?'
Three classes of trigger drive retraining decisions. None is sufficient on its own.
Performance triggers fire when your golden eval set — 50 to 200 representative queries with verified correct answers — drops below threshold. Most rigorous, also lagging. By the time the eval shows degradation, real users have been getting wrong answers for weeks. Pair it with PSI-based input distribution monitoring that catches when the kinds of questions users ask start diverging from what the model was trained to handle.[3] Below 0.1 is stable. Between 0.1 and 0.25 is investigation. Above 0.25 is a retraining review on the queue.[5]
Event triggers fire when something organizational happens that you already know invalidates a slice of the corpus: a major product launch, a pricing change, a policy update, an org restructure. These are the runs the performance system never catches in time, because performance has not had a chance to degrade yet — the model simply has not seen the new world. Event triggers require a real mechanism: the team that owns a fast-decay domain has to be able to tell the platform team their domain just moved, and that signal has to reach the pipeline.
Scheduled triggers run on a fixed cadence regardless of signal. They catch the slow gradient drift that statistical tests miss until it accumulates past detection. A quarterly review that walks the corpus and asks which domains have updated since the last checkpoint is the floor, not the ceiling, for a serious sovereignty program.
The failure mode here is over-triggering, not under-triggering. Sensitive thresholds produce frequent runs on narrow deltas, and frequent runs on narrow deltas produce catastrophic forgetting — new information arrives, previously-mastered knowledge degrades. Require all three trigger classes to independently agree before committing a run. One signal is a hypothesis. Three is a decision.
Catastrophic forgetting is not a training bug. It's a data construction problem. The fix is in the corpus, not the optimizer.
Empirical studies confirm that catastrophic forgetting worsens with model scale — larger models lose more on prior tasks when fine-tuned on narrow new data.[10] The mechanism is straightforward: gradient updates for new tasks overwrite the weight configurations that encoded prior knowledge. Train exclusively on this quarter's policy documents and the model rebuilds those weights toward policy tasks, away from everything else.
Three mitigation strategies exist in production. All three should be combined:
Replay. Include a representative sample of data from prior knowledge domains in every training corpus. Research on continual learning shows that even a 1% rehearsal ratio of historical data meaningfully slows forgetting, though 10-20% is more reliable for complex knowledge domains.[12] For enterprise fine-tuning, a practical replay corpus mixes the new domain delta (60-70% of training tokens) with a stratified sample across all four decay classes (30-40%). The replay sample does not need to be the full original corpus — it needs to cover the semantic distribution of what the model already knows.
EWC (Elastic Weight Consolidation). After training on a prior task, EWC estimates how important each weight is using the Fisher Information Matrix, then adds a penalty term to the loss function that resists updates to the most important weights. Research on Gemma2 continual pre-training found EWC preserved benchmark performance on Arc, GSM8K, and Belebele while successfully injecting new domain knowledge.[11] The computational overhead is non-trivial — storing Fisher information for a 7B model requires significant memory — but for narrow LoRA adapters it's tractable.
Architecture separation. LoRA adapters trained per domain keep new-domain gradients physically separate from the base model weights. This structural separation is the most practical catastrophic forgetting mitigation for teams running continuous fine-tuning at enterprise scale — an adapter for pricing logic doesn't overwrite the adapter encoding compliance knowledge. The tradeoff: inference-time adapter switching adds latency, and per-adapter governance gets complex fast.
A 2-3 point MMLU regression is your warning threshold. A healthy fine-tune holds or improves MMLU on domain-adjacent subjects while general performance stays flat.[15] If MMLU drops more, the replay ratio is too low or the corpus is too narrow.
100% new delta data from fast-decay domains only
No coverage of prior-task semantic distribution
Large learning rate, full parameter update
MMLU degrades 3-8 points after each run
Rollback rate climbs past 1-in-4 fine-tuning jobs
60-70% new delta + 30-40% replay across all decay classes
Stratified replay sample covers prior knowledge distribution
Lower learning rate or EWC regularization on critical weights
MMLU stays flat or improves on domain-adjacent subjects
Rollback rate stays below 1-in-10 with stable audit and data QA
The choice isn't philosophical. It's a cost-accuracy-speed tradeoff with concrete numbers attached.
Most continuous sovereignty programs should default to LoRA or QLoRA for domain delta runs. The compute difference is real.
Full fine-tuning a 7B parameter model requires 100-120GB of VRAM — roughly 8×H100s at ~$3/hour each, with runs taking 24-48 hours. Total cost: $500-1,000 per run before data prep.[15] QLoRA on the same 7B model runs on a single H100 at $10-16 and finishes in 8-12 hours. The quality gap is smaller than you'd expect: PEFT methods retain 90-95% of full fine-tune quality on most domain-specific tasks.
The case for full fine-tuning narrows to two scenarios. First, when you need behavioral alignment changes that LoRA adapters don't capture — deep RLHF preference shifts, safety alignment rework, fundamental changes to output style across the entire model. Second, when you're making the initial training run from scratch rather than applying a domain delta to an existing base.
For the sovereignty program's steady-state operation — quarterly domain delta runs, event-triggered updates to fast-decay domains — QLoRA with replay is the right default. It's fast enough to run on a monthly cadence for high-velocity domains without breaking the compute budget, and it keeps the catastrophic forgetting risk lower because the adapters touch fewer parameters.
| Factor | QLoRA / LoRA adapter | Full fine-tune |
|---|---|---|
| GPU requirement (7B model) | Single H100 or A100 (40-80GB VRAM) | 8×H100 (100-120GB total VRAM) |
| Compute cost per run | $10-50 (spot instances available) | $500-1,000+ |
| Training time | 2-12 hours | 24-48 hours |
| Quality vs. full fine-tune | 90-95% parity on domain tasks | Baseline |
| Catastrophic forgetting risk | Lower — adapters don't touch base weights | Higher — requires full replay strategy |
| Use when | Domain delta updates, continuous sovereignty cadence | Initial training run, RLHF/safety rework, major behavioral shifts |
| Inference overhead | Negligible (adapters merge at load time) | None |
Retraining on everything every time is expensive and regression-prone. The audit names the minimum effective delta — what changed, by how much, and whether a run on that domain is justified.
The knowledge audit runs before every retraining job. Its output is not a training set. Its output is a decision: retrain on these domains, skip those, here is the minimum corpus to do it without breaking the model.
The audit produces three numbers per domain.
Change magnitude measures how much the domain has actually moved since the last checkpoint. A documentation update that fixes three typos has near-zero magnitude. A policy document that reverses a core business rule has high magnitude. Magnitude comes from diff coverage — percentage of tokens that changed — not from a binary updated-or-not flag.
Gap severity measures how badly the current model performs on queries that touch this domain. If a domain changed but the model still answers it well — possible when the changes do not affect the semantics of common queries — retraining on it is not urgent. Severity is the eval set, filtered to that domain, scored against human-validated reference answers.
Retraining cost is the run itself. LoRA or QLoRA adapters trained on a narrow domain delta are cheap enough that the bar to fire one should be low. Full fine-tuning runs are expensive enough that both high change magnitude and confirmed gap severity are required before anyone commits compute.
MCP context servers are a natural feed for both the audit and the corpus. If your organization already runs MCP servers for inference-time retrieval — documentation, CRM connectors, policy stores — those servers feed the audit too. They are already curated, structured, and maintained by the teams that own the content. Convert MCP-served documents into fine-tuning examples through synthetic Q&A generation and you get a single organizational knowledge layer: one source drives both retrieval at inference and periodic fine-tuning. Two systems that drift apart over time become one system that does not.[8]
Standard eval harnesses miss two failure modes specific to agentic workflows. Both demand dedicated gates before a candidate touches production traffic.
The standard playbook — run the candidate against the golden eval set, check scores, deploy if it passes — is correct for text quality. It misses two failure modes that show up in fine-tuned models running inside agentic workflows.
Tool call schema drift. When a fine-tuned model learns a slightly different output format for tool calls — a renamed key, a changed response structure, a new field — live agents that depend on specific JSON schemas either fail silently or throw parse errors. Text quality metrics do not catch this. The model is still producing high-quality text. It just happens to break the machine-readable structure the agent depends on. The fix is a dedicated tool call compliance check separate from the eval: run the candidate against every tool invocation scenario in the agent's toolkit, parse every JSON it emits, validate against each tool's schema. Compliance must be 100%. A 99% pass rate means roughly one in 100 tool calls fails — and in a multi-step workflow where five sequential tool calls are routine, a 1% per-call failure rate compounds to about 5% per workflow. That is not a passing grade. That is shipping a regression on a schedule.
Behavioral consistency. The model's implicit patterns — default strategies, confidence calibration, escalation tendencies — can shift after retraining in ways that are hard to quantify on a static eval set and immediately visible to users. The agent gets more aggressive about closing tasks without confirmation. Responses get longer for trivial queries. Ambiguous instructions get handled differently. None of this trips the eval. All of it shows up in support tickets the week after promotion.
Shadow deployment is the gate that catches both.[6] The candidate runs alongside production, takes all real traffic in parallel, generates responses, never serves them. The shadow's tool call patterns and response behaviors get diffed against production on every request. When the shadow matches or exceeds production across 95%+ of queries with no regressions on critical cases, proceed to canary. Canary at 1-5% traffic with automated rollback gates layers real user signal on top before full promotion. Two gates. Both required. Neither optional.
Golden set eval catches text quality regressions
Misses tool call schema drift. Agents break silently after promotion.
Misses behavioral pattern shifts that surface as user confusion
No exposure to real production traffic distribution before going live
Rollback only after users have already taken the hit
Golden set eval still catches text quality regressions
Tool call compliance gate: 100% schema validity across every tool
Shadow deployment surfaces behavioral shifts against real traffic with zero user impact
Canary at 1-5% layers real user signal before full promotion
Automated rollback gates trip before widespread exposure, not after
Execute the fixed eval set against the candidate and record scores against the production baseline. Separately, run every tool invocation scenario the agent uses and validate the JSON output against each tool's schema. Both must pass. Text quality passing while tool schema compliance fails is not a passing result. It is a deferred outage.
Mirror all production traffic to both models simultaneously. The shadow generates responses but never serves them. Capture tool call patterns, response length distributions, refusal rates, and uncertainty signals. Diff against the production model on the same inputs. This is the only mechanism that exposes behavioral consistency under real workload distribution rather than synthetic eval geometry.
Route a small slice to the candidate. Set rollback gates that trip without a human in the loop: tool call schema error rate above 0.1%, user escalation rate more than 10% relative above production baseline, response quality at P90 below production. Human-monitored gates are a fallback, not a substitute. By the time a human reads the dashboard, users already filed the tickets.
Promote with an explicit version tag in the model registry. Log what triggered the run, which domains were included, the eval delta — gains and losses — and who approved the promotion. The 'knowledge as of' timestamp belongs in model artifact metadata, not in a Jira ticket somebody has to find later.
Promote without provenance and you lose the ability to audit why a model knows what it knows — or why it stopped knowing something.
MLflow 3.0 extended the model registry for generative AI: each run is an immutable record connecting the fine-tuned artifact to the training data version, evaluation run, and deployment metadata.[14] That lineage chain is what makes a sovereignty program auditable. Without it, you can tell regulators you own the weights but not what those weights were trained on, when, or by whom.
Every promoted artifact should carry six fields as metadata: the base model version, the fine-tuning method (LoRA adapter vs. full), the training corpus version hash, the corpus date range (this is the knowledge freshness timestamp), the eval delta against the prior production version, and the trigger type that initiated the run. This is not documentation overhead. It is the diff between being able to diagnose a regression and rebuilding from incident tickets.
Version semantics matter too. Use a three-component tag: base-v2.major-corpus-YYYYMMDD.minor-adapter. The base version tells you which foundation model the adapter was built on. The major-corpus date tells you the most recent knowledge this version carries. The minor adapter count tells you how many narrow updates have been stacked since the last major corpus refresh. When a user reports that the model is wrong about something, you can read the version tag and immediately know whether the information they need predates the training corpus.
Year one of a sovereignty program has distinct phases. The expensive mistake is rushing past instrumentation to the first retraining run.
Q1 is infrastructure. Not training. Teams that fire their first retraining run before instrumentation is in place have no way to know whether the new model is better or worse than the one it replaced. Q1 deliverables: production telemetry instrumented and logging, golden eval set built from real production queries with verified correct answers, knowledge corpus classified by decay rate, baseline drift scores established. That last one matters more than it sounds. You cannot detect drift without a stable baseline to measure against. Start monitoring on the same day you start deploying and every signal looks like drift.
Q2 is the first full cycle. By now telemetry has enough history to show where drift is happening. Run the first knowledge audit, identify the fast-decay domains that have moved most since the initial training, run a targeted fine-tuning job. The first cycle will surface gaps in the validation pipeline — the golden set will miss edge cases shadow deployment catches, canary will reveal behavioral inconsistencies neither caught. These are improvements to the pipeline, not failures of the approach. The first cycle is supposed to be messy. The pipeline learns by being run.
By Q3, the loop runs with minimal manual intervention. Event-triggered jobs fire when fast-decay domains update. Quarterly reviews run the broader audit. The drift dashboard does not require daily inspection. Q3 is also when to run the first governance review: which domains are actually changing, which retraining jobs produced measurable accuracy gains, which domains might be better served by retrieval than continued fine-tuning. Some content changes faster than any retraining cadence can follow. That is a retrieval problem, not a training problem.
Q4 is measurement and course correction. At twelve months, you can compare the model on proprietary knowledge queries to where it stood at initial deployment. Two findings show up consistently. Some domains are drift-resistant enough that retraining frequency can drop with no accuracy cost. Other domains change so fast that fine-tuning will never catch up and a live retrieval layer is the right long-term architecture. Both findings shape Year 2. Sovereignty is not a static target. The right architecture for it shifts as you learn how your organizational knowledge actually ages.
How do we decide between RAG and fine-tuning for a given knowledge domain?
Query structure and update velocity decide. RAG wins when answers need to cite specific source passages, when knowledge changes faster than any retraining cadence can follow, or when compliance requires tracing answers to dated documents. Fine-tuning wins when answers reflect internalized reasoning style rather than retrieved passages, when query latency cannot absorb retrieval, or when knowledge lives more naturally as implicit behavioral patterns than explicit facts. Most production systems run both. Fine-tuning carries behavioral alignment and stable domain knowledge. A retrieval layer carries fast-decay content like current pricing and active projects. Treating them as mutually exclusive is how teams get neither right.
Won't shadow deployment double our inference costs during the validation window?
Yes. Shadow deployment roughly doubles inference cost for the shadow window. The tradeoff: a regression that reaches full traffic costs more in rollback overhead, support escalation, and user trust than a 48-hour shadow run. The teams that argue the cost is too high have not priced what a bad deployment actually costs. Bound the window. 48 hours to one week, never indefinite. If even a 48-hour full-traffic shadow is prohibitive, route 20-30% of traffic to shadow instead of 100%. You lose some signal fidelity. You keep the behavioral consistency check.
How large should the golden eval set be?
Fifty to two hundred questions is the working range for most enterprise deployments. Below fifty you do not have the statistical power to separate genuine regression from sampling noise. A one-question score drop in a 20-question eval means nothing. Above two hundred the marginal signal drops while the cost of maintaining verified answers climbs. Coverage matters more than size: at least 10-15 questions per tracked domain, density weighted toward fast-decay domains. Build the eval from real production queries. Synthetic evals measure the model on your hypotheses about what users ask, not what they actually ask. The two distributions diverge more than most teams expect.
What happens when a retraining run makes things worse?
More common than teams expect, especially in the first few cycles. The most frequent cause: the corpus included too narrow a slice of recent data without enough stable foundational content. Catastrophic forgetting on every domain not in the delta. The fix is procedural. Every retraining corpus has to include a representative sample across all four decay classes, not just the fast-decay domains that triggered the run. Track rollback rate as a KPI. If more than one in five runs produces a worse model, the audit or the data curation has a systematic problem — not a calibration problem. Investigate the pipeline, not the threshold.
Who should own the sovereignty program — the ML team, the platform team, or the business?
All three, with explicit handoffs. The platform team owns the infrastructure: drift monitoring, fine-tuning pipeline, validation gates, model registry. The ML team owns the training quality: corpus curation, eval design, hyperparameter choices, catastrophic forgetting prevention. The business owns the trigger signals and the promotion approval: domain owners report when their content has changed, a business or legal stakeholder approves promotion. The failure mode is the platform team owning everything and trying to detect domain change through monitoring alone. Drift signals lag organizational change by 4-8 weeks. The fastest, most accurate signal for event triggers is the human who owns the domain, not a statistical test.
How do we prevent one LoRA adapter from breaking another?
Adapter isolation is the main risk when running multiple per-domain LoRA adapters against the same base model. Adapters trained independently and served via sequential or merged loading can produce unexpected interference when their fine-tuned behaviors overlap semantically. Two practical approaches: first, train adapters on disjoint domains with low semantic overlap and test them together, not just independently. Second, use adapter merging carefully — linear merging works well for adapters trained on complementary knowledge; gradient conflicts require more sophisticated merging (task arithmetic, TIES). Monitor per-domain accuracy after any adapter stack change, not just global eval scores.
Sovereignty is not ownership. Ownership is a legal fact about who controls the weights file. Sovereignty is an operational fact about whether those weights reflect the organization as it actually exists today — its pricing, its policies, its structure, its tacit knowledge. That gap between the two definitions is where most fine-tuning programs quietly fail.
The program described here isn't technically novel. PSI monitoring is standard credit risk tooling applied to a new domain. Replay buffers are thirty-year-old ideas from continual learning research. Shadow deployment predates LLMs by a decade. What makes them a sovereignty program is the closed loop: telemetry feeding detection, detection feeding audit, audit feeding corpus construction, corpus construction feeding a validated promotion process, promoted model feeding back into telemetry. Every link in that chain has to close. Leave one open and the snapshot problem reasserts itself six months later.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.