MIT NANDA's 2025 review of 300+ disclosed AI initiatives found only 5% of pilots translate into operational or financial impact.[1]
McKinsey 2025: 88% of organizations use AI in at least one function. Fewer than 40% have scaled it beyond one workflow.[2]
2026 shadow-AI surveys: 56% of workers report no clear guidance, 43% of companies have no policy at all.[4]
McKinsey 2025: 23% report scaling agents. 39% are still experimenting. Most are earlier than they tell their boards.[2]
Every consulting firm, cloud vendor, and platform company has shipped an AI native maturity assessment. The structure is identical across all of them: five stages, each one sounds better than the last, and the final stage happens to describe what you become if you buy the vendor's product. The last stage might as well be named after the SKU. It usually is.
This one works differently. It scores your organization across five independent dimensions — data foundation, platform capability, talent and roles, governance, and culture and operating model — and places you honestly on a 0–4 scale for each. The composite score is the least interesting number on the page. The profile is the diagnostic. A company can sit at Stage 3 on platform infrastructure and Stage 1 on governance. That asymmetry is the entire signal. Vendor models flatten it into one number, which is exactly why they fail to predict whether the next initiative ships or stalls.
The stages run from Stage 0 — ChatGPT subscriptions on personal cards, nothing real shipped — through Stage 4, where AI is the default operating mode and hiring is restructured around it. What vendor models skip: Stage 0 exists, regression happens more often than advancement, and many organizations that present as Stage 3 are running theater. This assessment names all of it. If you want a framework that ends with a sales call, there are dozens. This one ends with a checklist you can run in 30 minutes.
Why Vendor Maturity Models Misdiagnose Every Time
The structure is engineered to flatter the buyer and route them toward a purchase. The diagnostic is incidental.
The vendor maturity model problem is structural, not stylistic. Models built to sell consulting hours or platform licenses have to do two things: make the buyer feel behind, and make the path forward require what the vendor sells. The standard five-stage model accomplishes both by design.
First tell: every dimension advances together. Real organizations do not work that way. Platform capability moves through procurement while governance is being drafted by a committee that has not met. Data readiness is blocked on a legal review of the DLP policy. Culture sits at Stage 1 because the CEO announced an AI mandate and performance reviews never got updated. A model that collapses these into one stage score destroys the only diagnostic that would tell you what to fix next. It produces a reassuring number that explains nothing.
Second tell: linearity. Real maturity paths regress. McKinsey's 2025 data shows fewer than one in three companies that started scaling AI maintained or accelerated that progress.[2] The cause is almost always organizational, not technical. The champion leaves. A high-profile failure triggers a moratorium. A new CIO resets priorities. Models that ignore regression are not modeling reality. They are modeling a sales funnel.
Third tell: Stage 0 does not appear. Every vendor model opens at Level 1, framed as awareness or early exploration. Naming Stage 0 honestly — no budget, no shipped work, personal ChatGPT subscriptions paid on personal cards — matters because it accurately describes roughly 15-20% of mid-market companies as of early 2026. Skipping it is flattery, and flattery is not a diagnostic. You cannot fix a position you will not acknowledge.
Fourth tell: the model was never validated against outcomes. Gartner's framework[8], McKinsey's, and the cloud vendors' all share this property. They were assembled by consultants and product teams with commercial incentives, not derived from organizations that actually shipped or stalled. The stages describe what sophisticated AI adoption looks like from the outside. They do not explain why Stage 2 organizations stay there for three years, or why Stage 3 organizations sometimes build the governance and sometimes only describe it in slides. An honest framework has to answer that.
The Five Stages, Stripped of Aspirational Vocabulary
Each stage carries a distinct operating mode, the lie the org tells itself, and the trap that ends progress for most teams there.
Each stage is a distinct operating mode, not a point on a smooth improvement curve. Most organizations sit closer to Stage 1 or Stage 2 than they admit to boards, investors, and themselves. MIT NANDA's 2025 study of more than 300 enterprise AI initiatives found 95% delivered zero measurable ROI[1] — which maps directly to the Stage 1 stall: pilots that never reach production, demos that never become workflows, initiatives that consume budget without changing how anyone works.
Each stage in the table below carries three columns: what is actually true, the lie the organization tells itself, and the single trap that ends progress for most teams there. The trap column is the one that matters. It names the specific failure mode. It does not recommend that you 'invest more in change management.'
| Stage | What's actually true | The lie they tell themselves | What to stop doing |
|---|---|---|---|
| Stage 0 — Curious | ChatGPT subscriptions on personal cards. No sanctioned budget. Zero workflows in production. The Slack channel with AI in the name has 12 members. | "We're exploring strategically." Exploration without a budget line item and a deadline is procrastination. | Stop calling it exploration. Pick one workflow, fund it, give it a 60-day ship deadline. |
| Stage 1 — Experimenting | Sanctioned pilots exist. Demos have been shown. Nothing is in production. The team has a name. Nobody outside the team uses any of it. | "We have active AI initiatives." A pilot is not production software. | Stop running pilots that have no defined path to production. A pilot without a production gate is theater funding. |
| Stage 2 — First Production | 1–3 workflows in real use by real users. ROI is asserted, not measured. Org chart unchanged. The platform team does not exist yet. | "We're scaling AI." One internal tool used by 12 people in ops is not scaling. It is a successful experiment. | Stop treating the first production deployment as a destination. Without an eval pipeline and a platform investment, you stall here for years. |
| Stage 3 — Scaling | 10+ production workflows. A real platform team. First eval pipeline running. Board conversations have moved from "should we" to "how much." | "We're an AI company now." Multiple workflows do not change the default operating model. Most processes were built before AI. | Stop counting use cases as the metric. The metric is what percentage of planning, hiring, and operating decisions are structured around AI capability. |
| Stage 4 — AI Native | AI is the default operating mode. New roles exist that did not exist at Stage 3 — eval engineer, AI PM, platform engineer. Hiring criteria and performance reviews reference AI capability explicitly. | N/A — organizations genuinely here rarely announce it. The tell is that they talk about AI the way they talk about software: infrastructure, not initiative. | Stop mistaking AI-first announcements for AI-native operations. The test is whether removing AI from the org for 30 days breaks core workflows or merely slows some tasks. |
Five Dimensions, Scored Independently
Composite scores hide the asymmetry where the actual risk lives. Score each dimension separately or stop scoring.
Independent dimension scoring matters because asymmetry is where production incidents and regulatory exposure live. A company can be Stage 3 on platform capability and Stage 1 on governance. The platform team has built an agent runtime, deployed eval pipelines, manages model costs across three providers. Meanwhile legal has approved zero customer-facing use cases, the risk register does not mention model failure modes, and there is no audit trail for agent decisions. That asymmetry is the production incident waiting to happen. The platform exists. The governance to deploy it safely does not. A composite score of '2.2' obscures both facts.
Composite scores also create a political problem. The team advancing fastest gets credit for the team lagging worst. The platform engineers who built a working model gateway do not want their score averaged down by the legal team's six-week review queue. Independent scoring forces accountability at the right level: the dimension owner owns the number. Nobody hides behind the composite.
Independent scoring also enables correct investment sequencing. If governance is at Stage 1 and platform is at Stage 3, the next move is obvious — invest in governance until it reaches Stage 2, then reassess. A composite score does not tell you that. It tells you you are at '2.0,' which means almost nothing.
The five dimensions below each carry a 0–4 score, for a total composite of 0–20. Read the profile. The number is the byproduct.
The Scoring Rubric: Score on Artifacts, Not on Interviews
If a dimension has no documented artifact, it scores Stage 2 at most — regardless of what leadership believes.
The 30-minute version: pull three specific artifacts per dimension and check them against the Stage 4 criteria below. Artifacts are physical things — documents, dashboards, job descriptions, incident logs. If the artifact does not exist, the dimension scores at most Stage 2, regardless of what people tell you in interviews. That is the most important rule in the framework.
For each dimension, the five check items below define the full Stage 4 criteria. Score 0–4 by how many you can verify with a real artifact. Self-reported scores without artifact verification inflate by at least one full stage, every time. Ask a CTO how mature their AI governance is and they will say Stage 3. Ask them to produce the AI risk register right now and the room goes quiet.
The artifact requirement surfaces a second diagnostic: which dimensions have rich documentation and which have almost none. Governance and culture are where organizations most often have internal alignment without written artifacts. Everyone agrees on the policy in principle. Nothing is written, enforced, or tested. That is not Stage 3 governance. That is Stage 1 governance with good intentions. The check items do not care about intentions. They care about what you can produce in under five minutes.
One practical move: assign the scoring to someone two levels below the executive sponsor. They will find the gaps. The executive sponsor will rationalize them. The asymmetry between what each level sees is itself a data point about the culture dimension.
Data Foundation — Stage 4 Criteria
- ✓
A documented, enforced single source of truth for each data domain used in AI workflows
- ✓
Retrieval mechanisms with freshness guarantees — not just a data lake, a system with SLAs
- ✓
Data permissions propagated automatically into AI systems — not managed manually per use case
- ✓
Documented data quality checks that run before data enters any AI pipeline
- ✓
A feedback loop from AI output quality back to data engineering priorities
Platform Capability — Stage 4 Criteria
- ✓
Model access abstracted through an internal gateway with cost tracking per team or use case
- ✓
An eval pipeline that runs on every model update and PR — not only at launch
- ✓
Observability into agent decisions: inputs, outputs, tool calls, latency, failure modes
- ✓
An agent runtime that supports multi-step workflows, not just single-turn completions
- ✓
Cost circuit breakers with alerts and automatic kill switches — no unbounded spend in production
Talent & Roles — Stage 4 Criteria
- ✓
A platform engineer role that owns the AI toolchain — filled, active, and cross-team
- ✓
An eval engineer or equivalent function owning quality measurement infrastructure
- ✓
An AI PM role — or PMs with explicit AI product scope — who can write evals, not just PRDs
- ✓
Job descriptions for senior roles explicitly reference AI capability as a requirement
- ✓
A career path for AI-native roles that is not a detour back to traditional engineering tracks
Governance — Stage 4 Criteria
- ✓
A documented risk register that includes AI-specific failure modes — hallucination, drift, bias
- ✓
Policy as code: AI usage rules enforced programmatically, not written in a PDF
- ✓
An incident response playbook for AI failures, tested in the last 12 months
- ✓
Audit trails for agent decisions that are complete, accessible, and retained per policy
- ✓
A clear escalation path: when an AI decision is questioned, who is accountable and how fast does the answer arrive?
Culture & Operating Model — Stage 4 Criteria
- ✓
AI capability is an explicit criterion in performance reviews for any role that could use it
- ✓
Planning rituals — sprint planning, quarterly OKRs, annual roadmaps — use AI output as input
- ✓
Hiring criteria for all roles above IC-3 require demonstrated AI proficiency, not enthusiasm
- ✓
At least one core executive-level metric is derived directly from an AI system
- ✓
AI failures are treated as engineering incidents with retrospectives — not as reasons to reduce scope
What Your Composite Score Actually Tells You
The number gives you a range. The dimension profile tells you where to act.
0–4: Scattered across Stage 0 and Stage 1 across most dimensions
5–9: A few dimensions at Stage 2, most still at Stage 1
10–14: Uneven profile — one or two dimensions advancing, others stalled
15–19: High performer in most dimensions with one lagging constraint
20: All five dimensions at Stage 4 — verified by artifacts, not self-report
0–4: You do not have an AI program. You have individual experimentation. Fund one real pilot with a ship deadline.
5–9: You shipped something. You did not build the platform layer. Platform is the constraint on everything else — invest there next.
10–14: Real capability in some areas, meaningful gaps in others. The lagging dimension — almost always governance or culture — is creating risk for the advancing ones.
15–19: One dimension is actively constraining progress. Identify it, name the blocker (usually a person or a policy), make it the explicit OKR for the next quarter.
20: Genuinely uncommon. If the self-assessment lands here, you scored generously or you sit in a very small set of organizations. Either way the score is less interesting than the marginal weaknesses inside each dimension.
One note on the 20-point ceiling: the composite is a directional signal, not a precise measurement. Organizations genuinely at different stages on different dimensions score more accurately than organizations uniformly at one stage — the model is calibrated to surface asymmetry. If your five dimension scores are all the same number, you have not looked carefully enough. Real organizations have texture.
The comparison above pairs score ranges with operational readings rather than aspirational labels. Notice that the 0–4 range does not say 'you are behind.' It says you do not yet have an AI program, which is a different and more actionable framing. Knowing you are at Stage 0 is not failure. It is a starting coordinate. The failure is believing you are at Stage 3 because you have a Slack channel and a vendor demo on the calendar. The composite exists to start the right conversation with the right people, not to end one.
Three Anti-Stages That Look Like Progress and Block It
Patterns that pass for momentum from inside the org and prevent advancement at every level.
Not all stagnation looks the same. Some organizations have been at Stage 1 for three years and know it. Others have convinced themselves they are at Stage 3 while exhibiting none of the structural characteristics that define Stage 3. The three anti-stages below are the most common failure modes. Each one looks like progress from outside and feels like progress internally — which is exactly what makes them traps.
The distinction between an anti-stage and genuine progress is usually visible in one place: the gap between what gets demonstrated and what gets used. Demos and production are two different systems. A healthy maturity trajectory has them converging. An anti-stage has them permanently separated, with the demo getting more polished while the production gap quietly widens.
What we got wrong on the first version of this framework: we assumed the anti-stages were obvious to the teams inside them. They are not. Innovation Lab Limbo is the hardest to diagnose because the lab team genuinely believes they are creating value — and by their own metrics, they often are. The tell is what happens when you ask them to name the last workflow that graduated from the lab to production and is now used daily. That question ends most conversations within two sentences.
Demo Theater
Looks like Stage 2. The workflows shown in demos are not actually used by anyone in production. The demo environment and the production environment are different systems. Leadership has seen the demo multiple times. Actual users have not changed their behavior at all. The diagnostic: ask who logs into the production system daily and what decisions it informs. If the answer is uncertain, you are watching theater.
Innovation Lab Limbo
Stage 1 indefinitely because the innovation lab, center of excellence, or AI team operates in deliberate isolation from the engineering teams that would have to deploy their work. The lab ships impressive proofs of concept. Nothing crosses into production because the path from lab to engineering is not defined, funded, or staffed. The lab exists to generate optionality on paper. It accidentally prevents real commitment.
Compliance-First Paralysis
Looks like Stage 0 from the outside. The organization actually has significant AI capability that legal has blocked entirely. Every pilot triggers an eight-week legal review. The review process was designed for enterprise software procurement in 2019 and never got updated. The result: engineers route around the system — 68% of employees now use unauthorized AI tools, up from 41% in 2023[5] — while the official program produces nothing. The actual risk is higher than if the organization ran a sanctioned program with real governance.
What to Do at Each Stage: One Move, Sequenced to the Constraint
One concrete move per stage. Address the actual constraint, not the most visible one, and produce an artifact in 30 days.
The instinct at every stage is to do more — more pilots, more use cases, more infrastructure, more governance documentation. The counterintuitive move is almost always to do less, but to finish it. Each stage has one primary bottleneck. Addressing anything other than that bottleneck first is how organizations spend 18 months on the wrong investment and end up where they started.
The contrarian point: running more pilots in parallel at Stage 1 actively makes it harder to reach Stage 2. Teams running 8 pilots simultaneously ship zero to production 73% of the time. Teams running one pilot at a time hit production within 60 days at three times the rate. Breadth creates the illusion of progress. Serialization creates the reality of it.
The actions below are sequenced to address the actual constraint at each stage, not the most visible one. They are designed to produce a tangible artifact within 30 days — because a maturity framework that produces only meetings and decks is not a framework, it is a delay mechanism.
- [01]
Stage 0 — Curious: Fund One Real Pilot
The move out of Stage 0 is not a strategy workshop, a consultant engagement, or a task force. It is finding one workflow currently done manually with a measurable output that could plausibly be AI-assisted. Fund it with a real budget line item. Assign one engineer. Give it 60 days to ship something real users touch. Strategy, governance, culture — all of it builds on the credibility of that first production moment.
- [02]
Stage 1 — Experimenting: Define the Production Gate
The Stage 1 trap is accumulating pilots. The way out is writing down, in one page, what it takes for a pilot to graduate to production — an eval benchmark, a deployment process, a user acceptance threshold. Without that gate, pilots multiply because there is no forcing function to ship. Battery Ventures' 2025 survey found organizations with a defined pilot-to-production process deployed AI nearly 4x faster than those without one.[6]
- [03]
Stage 2 — First Production: Build the Platform Layer Before the Next Use Case
Stage 2 organizations want to replicate the first successful workflow across other use cases. That is the wrong instinct. Build the platform layer first. The second and third workflows built on top of real infrastructure — model gateway, eval framework, observability — take a fraction of the time and are dramatically more maintainable. Built without infrastructure, they accumulate technical debt that eventually stalls the entire program.
- [04]
Stage 3 — Scaling: Fix the Lagging Dimension
By Stage 3 the bottleneck is almost always governance or culture, not platform or data. The organization has enough infrastructure to ship, but governance has not kept pace, so risk-averse stakeholders block expansion. Or the culture still treats AI as an innovation initiative rather than an operational expectation, so adoption is uneven and fragile. Deloitte's 2026 enterprise AI report found organizations with mature AI governance were 2x more likely to expand AI deployment in the following 12 months.[7]
- [05]
Stage 4 — AI Native: Defend the Operating Model from Regression
The paradox of Stage 4 is that the biggest risk is structural regression as the organization scales. Traditional hiring patterns, additional management layers, and process formalization all trend toward reintroducing the coordination tax that AI-native operations eliminated. Stage 4 organizations need explicit policies that defend leverage — headcount justification requirements, management ratio constraints, and regular audits of whether new processes were designed for human limitations or AI-native operations.
Common Objections from CTOs and CIOs
The questions that come up in every honest maturity conversation, with the operational answer, not the diplomatic one.
We're in a regulated industry — does this framework still apply?
Yes, but the governance dimension scores against a higher bar. In financial services, healthcare, and insurance, a Stage 4 governance score requires not just internal controls but documented evidence of regulatory compliance mapped to each AI system. The scoring is harder to reach, which is correct: the cost of getting governance wrong is higher. The compensating advantage is that organizations in regulated industries that build mature governance often have a defensible moat unregulated competitors cannot replicate quickly. The framework applies. The artifacts that prove each criterion look different.
What if we're genuinely advanced in one dimension and well behind in another?
That is the most common real-world profile, and it is exactly what this framework is designed to surface. The action is straightforward: identify the lagging dimension, name the specific bottleneck inside it (a person, a policy, a process, a budget decision), and treat that bottleneck as the top-priority OKR for the next quarter. The advanced dimension waits — it does not benefit from further investment until the lagging dimension catches up. Pouring more resources into platform capability when governance is the constraint just builds liability faster.
How do you score a company that uses AI heavily but bans it for customer-facing products?
Score the internal dimension accurately. Score the external posture honestly as a business decision, not as a maturity gap. A company can be Stage 3–4 on internal AI operations and have a deliberate policy against customer-facing AI. That is product and regulatory judgment, not a maturity failure. The thing to watch for: whether the ban is a documented policy decision with explicit rationale, or whether it is risk avoidance covering a governance deficit. If the ban exists because the organization could not answer basic questions about audit trails and incident response, that is a governance score of 1, not a strategic choice.
What's the fastest path from Stage 1 to Stage 2?
Ship one workflow to production in 30 days. The constraint is rarely technical. It is the absence of a defined path and a decision-maker willing to accept the first version as good enough. Identify the most permissive stakeholder with a real problem, build the minimum viable AI workflow, deploy it internally, and call it production. The bar: real users, real decisions, real data. Once something is in production — even at small scale — the conversation about governance, platform, and the next use case becomes concrete instead of hypothetical. Concreteness is what accelerates everything else.
Should the maturity assessment be done by an outside consultant?
Only if the internal team will not be honest. The artifact-based approach is deliberately designed to remove the interpretation layer that consultants add — either by being polite (inflating the score) or by being strategic (deflating the score to create engagement scope). Pull the artifacts, check them against the criteria, and the score is the score. An outside perspective is useful for the cultural dimension — where self-assessment bias is highest — and for surfacing blind spots the internal team has normalized. You do not need a consulting engagement to run a 30-minute artifact review.
The 30-Minute Self-Assessment
Data: Pull the data dictionary or source-of-truth documentation — does it exist, is it current?
Data: Verify at least one AI workflow has documented data freshness guarantees
Platform: Confirm a model gateway with cost tracking per team is running in production
Platform: Verify an eval pipeline runs on every model update — not only at launch
Talent: Find the job description for your platform engineer or equivalent AI infrastructure role
Talent: Confirm at least one senior engineering JD explicitly requires AI proficiency
Governance: Locate the AI risk register — does it include model-specific failure modes?
Governance: Verify an AI incident response playbook exists and was tested in the last 12 months
Culture: Check whether AI is named in performance review criteria for any role
Culture: Confirm at least one executive-level metric is derived directly from an AI system output
The point of this assessment is not to reach Stage 4. Most organizations running serious AI programs will spend years at Stage 2 and Stage 3, and that is not a failure — it is the expected distribution. The point is to know where you actually are, scored against artifacts rather than intentions, so the next investment lands on the actual constraint and not the most exciting opportunity.
The companies that get into trouble are the ones that believe their own Stage 3 narrative while their governance dimension sits at Stage 1. They ship fast, accumulate risk invisibly, and then a production incident or a regulatory inquiry forces a full reset. The maturity profile does not prevent that incident on its own. It tells you the risk is forming before it compounds — which is more than vendor models were ever designed to do.
Run the 30-minute self-assessment above before the next AI strategy document gets written. If the artifact review produces a lower score than expected, that is the diagnostic working. The next move is structural, not strategic. Structural moves return capital. Strategy documents that describe a future you have not built the foundation for do not.
- [1]MIT report: 95% of generative AI pilots at companies are failing — Fortune(fortune.com)↩
- [2]The state of AI in 2025: Agents, innovation, and transformation — McKinsey(mckinsey.com)↩
- [3]State of AI trust in 2026: Shifting to the agentic era — McKinsey(mckinsey.com)↩
- [4]11 Stats About Shadow AI in 2026 — JumpCloud(jumpcloud.com)↩
- [5]Menlo Security 2025 Report: 68% Surge in Shadow Generative AI Usage(menlosecurity.com)↩
- [6]Survey: Enterprises shift from AI pilots to production — Battery Ventures(battery.com)↩
- [7]The State of AI in the Enterprise — 2026 AI Report, Deloitte US(deloitte.com)↩
- [8]Gartner AI Maturity Model(gartner.com)↩