Is enterprise AI delivering ROI in 2026?

Locally yes, enterprise-wide rarely. McKinsey's State of AI 2025 (November 2025, n=1,993) found that only 39% of respondents attribute any EBIT impact to AI and only ~6% qualify as AI high performers attributing more than 5% of EBIT and significant value. MIT NANDA's GenAI Divide (July 2025) estimates that 95% of generative-AI pilots are failing to deliver measurable returns despite an estimated $30-40 billion of enterprise spend. Business-unit wins are real (Morgan Stanley adviser adoption, GitHub/Accenture developer telemetry, focused customer-service deployments), but firm-wide P&L impact still requires workflow redesign, measurement and governance most organizations have not yet built.

What does an enterprise LLM workload actually cost?

Less than executives assume on the meter, more than they assume on total cost of ownership. For an internal knowledge assistant doing 55,000 Q&A turns per month (~110M input + 33M output tokens), public list prices captured in May 2026 range from roughly $231 on GPT-5.4 mini to $1,650 on Claude 3.5 Sonnet via Bedrock. A 300-seat Microsoft 365 Copilot Business deployment at the annual list price already runs about $5,400 per month, before implementation. The expensive parts usually sit outside the base token price: data connectors, permissions, evaluation harnesses, security reviews, observability, change management and human review.

Why do most generative-AI pilots stall?

Gartner forecasts that at least 30% of generative-AI projects will be abandoned after proof of concept by end-2025; MIT NANDA puts the share returning zero P&L impact at around 95%. The reasons are operational, not technical: poor data quality, inadequate risk controls, escalating costs, unclear business value, missing baselines, and no plan for redeploying time saved. McKinsey 2025 confirms it: workflow redesign is the strongest leadership predictor of EBIT impact, and AI high performers are nearly 3x more likely than peers to have fundamentally redesigned a workflow. Pilot success says almost nothing about production economics.

What is the right operating model for enterprise AI?

A central control plane (data governance, model access, observability, policy, evaluation) combined with partly decentralised business-unit experimentation. Use a portfolio approach to use-case selection: start with high-frequency tasks where the workflow is measurable and the error surface is controlled. Run AI FinOps from day one: output caps, prompt caching, batch processing, routing simple requests to smaller models, separate budgets for model spend, search and human review. Tier human oversight by risk. Klarna's May 2025 reversal is the cautionary case study on going too far on automation alone.

Should we build, buy, or partner for enterprise AI?

All three, sequenced by use-case maturity and switching cost. Buy seat-based copilots for productivity work where vendor-managed grounding is acceptable. Build only where you need proprietary workflow logic, data residency, or model-routing economics that off-the-shelf platforms cannot give you. Partner for edge capabilities (vertical models, evaluation, governance tooling) where the market is moving faster than internal cycles. Be aware that vendor lock-in has moved from prompt portability to operational portability: PTUs, GSUs, cache semantics and routing tiers compound over time. MIT NANDA's 2025 data suggests purchased copilots currently outperform custom internal builds on adoption and ROI.

How do I measure AI impact credibly to my board?

Measure the business process, not the activity. Resolution time, merge rate, cycle time, conversion, gross margin, claims accuracy and working-capital velocity are the right units of account. Use holdouts, pre/post baselines or randomised roll-outs where possible. The GitHub/Accenture study is the strongest public template: it combines telemetry, user surveys and an explicit experimental design. Treat vendor-sponsored ROI multiples as ambition-setting, not as audited returns; treat MIT NANDA's 95%-fail and McKinsey's 6%-high-performer figures as the realistic outer bounds of where most programmes currently sit.

Enterprise Impact of the AI Fever: What Almost Four Years of GenAI Have Actually Bought

By Juan Beltrán · 2026-05-11

Almost four years since ChatGPT, the enterprise picture is real but uneven. McKinsey's State of AI 2025: 88% of organizations now use AI regularly, but only one-third have begun to scale and only ~6% qualify as AI high performers attributing more than 5% of EBIT to AI. MIT NANDA's GenAI Divide estimates 95% of generative-AI pilots return zero P&L impact. IBM 2026: 76% have a Chief AI Officer, but only 25% of the workforce uses AI regularly. The firms compounding value treat AI as an operating-model programme: they redesign workflows, route cheap models to cheap work, measure baseline and uplift, and govern AI as a production system.

The story is the gap

Almost four years after ChatGPT shipped on 30 November 2022, the most important enterprise number is not a benchmark score or a model price. It is the gap between what boards have authorised and what front-line workflows have actually absorbed.

McKinsey's State of AI 2025 (published November 2025, n=1,993, fielded June-July 2025) is the cleanest single read on that gap. 88% of organizations now report regular use of AI in at least one business function, up from 78% a year earlier. But only about one-third have begun to scale AI across the enterprise. Only 39% of respondents attribute any EBIT impact to AI use, and most of those say it is less than 5%. Only ~6% qualify as McKinsey "AI high performers" attributing more than 5% of EBIT and significant value to AI. The single strongest leadership predictor of high-performer status is fundamental workflow redesign: high performers are nearly 3x more likely than peers to have redesigned a workflow around AI.

The MIT NANDA report, The GenAI Divide: State of AI in Business 2025, sharpened that picture in July 2025. Despite an estimated $30-40 billion of enterprise spend on generative AI, MIT estimates that roughly 95% of corporate generative-AI pilots are still failing to deliver measurable P&L returns. The gap is concentrated in custom internal builds rather than purchased copilots. IBM's 2026 CEO Study, looking at the leadership layer, found 76% of organizations have appointed a Chief AI Officer and 64% of CEOs are comfortable making major strategic decisions on AI-generated input, yet only 25% of the workforce uses AI regularly in daily work and 83% of CEOs say success now depends more on people's adoption than on the technology itself.

That is the story. AI is now nearly universal in enterprise use, but only shallowly integrated. The firms compounding value are not the ones with the loudest narrative. They are the ones converting model capability into measured workflow change, controlling cost through architecture, and governing AI as a production system rather than a demo.

My operator read: when an AI programme cannot name the workflow owner, baseline, failure mode, and cost guardrail, it is still a demo even if the interface is in production. The organizations that create value make those four names boringly explicit before they ask for another pilot.

This piece is my synthesis of the evidence available up to 12 May 2026: McKinsey's State of AI 2025, MIT NANDA's GenAI Divide, IBM's 2026 CEO Study, the Stanford HAI AI Index 2025, Gartner forecasts, vendor pricing pages, public-company filings, official case studies, and standards guidance. I write it the way I read the market: numbers first, story second, opinion last.

What enterprises are actually building

The deployment mix is far narrower than the rhetoric of "AI transformation" suggests. The dominant patterns are copilots for office productivity, enterprise search on internal content, customer-service assistants, code assistants, document and meeting summarisation, and bounded agents that take actions only inside a workflow shell. Text generation is still the dominant modality. Code and images sit materially behind it.

A practical taxonomy:

Project type	Typical architecture	Why it wins early	Main cost or risk centre
SaaS productivity copilot	Vendor-managed model + tenant data grounding + seat licence	Fastest procurement and change path	Seat spend, identity, data boundaries
Enterprise search / knowledge chat	RAG over internal documents, vector or hybrid search, citations, RBAC	High value, low action risk	Retrieval quality, permissions, evaluation
Customer-service bot	Conversation orchestrator + knowledge base + CRM/workflow actions + escalation	Clear ROI and high volume	Hallucinations, deflection quality, customer trust
Code assistant	IDE extension or repo-grounded assistant	Immediate developer adoption and measurable telemetry	Security, code quality, IP, compliance
Agentic back-office workflow	LLM planner/router + tool use + approvals + API execution	Highest automation upside	Reliability, exception handling, governance

The architecture pattern that now dominates production "copilots" is not "one giant model". It is a stack: identity and access control, orchestration, retrieval-augmented generation, tool or API calling, observability, evaluation and human escalation. Microsoft, AWS, Google Cloud and OpenAI all publish near-identical reference architectures. OpenAI's own enterprise guide explicitly recommends starting with a single agent and adding tools incrementally rather than jumping directly to multi-agent systems.

Three technical decisions separate pilots from production. First, RAG for volatile or permission-sensitive knowledge, because it grounds answers on current company data without retraining the model. Second, tool use only where there is a formal workflow boundary, such as CRM updates, ticketing or meeting-note insertion. Third, fine-tuning or distillation reserved for high-volume, stable tasks where prompt-only control is no longer good enough.

The Morgan Stanley pattern is the clearest public example. The firm's early deployment focused on internal knowledge assistance and later on meeting debriefing, with an evaluation framework used to validate output quality before broad production use. By April 2025, public reporting from Morgan Stanley and OpenAI showed 98% daily adviser usage and a jump in document access from 20% to 80%, with natural-language querying over more than 100,000 proprietary research documents. That is not AI replacing the adviser. That is AI acting as a search, summarisation and drafting layer on top of existing human judgement.

Where to apply this on your own portfolio

If you want to pressure-test a specific initiative through the same six layers I use when I write about this (Baseline, Architecture, Surface, Intelligence, Containment, Scale), the diagnostic below is the interactive version of the framework I keep coming back to in this piece.

The economics of tokens, APIs and cloud spend

The pricing market has matured into several distinct models: direct token metering, seat licences, priority or reserved throughput, and platform-layer charges for routing, search, tools or governance. The core point is that a quoted "model price" is rarely the full delivered price.

Vendor pricing comparison

Vendor	Pricing model	Illustrative current public price	Important caveats
OpenAI	Direct API token pricing; Batch -50%; data residency +10%	GPT-5.5 $5 input / $30 output per MTok; GPT-5.4 $2.50 / $15; GPT-5.4 mini $0.75 / $4.50	Current flagship pricing centres on GPT-5.x; cached input is materially cheaper than base input.
Anthropic	Direct API token pricing; prompt caching; Batch -50%; priority and regional premiums on third-party platforms	Claude Opus 4.7 $5 / $25; Claude Sonnet 4.6 $3 / $15; Claude Haiku 4.5 $1 / $5	Cache hits cost 10% of base input rate; Opus 4.7 may consume up to 35% more tokens for the same text; tool use adds extra tokens.
Google Cloud	Vertex/Agent Platform token pricing with Priority versus Flex/Batch and provisioned throughput (GSUs)	Gemini 3.1 Pro Preview Flex/Batch: $1 / $6 per MTok up to 200k input tokens, then $2 / $9 above; Priority: $3.6 / $21.6 up to 200k, then $7.2 / $32.4	Pricing is tiered by context size and service tier; provisioned throughput is bought in GSUs, not per-call tokens.
Microsoft	Azure OpenAI token pricing plus PTUs; separate seat-based Copilot products	Azure GPT-4.1 Global Standard $2 / $8; Priority $3.50 / $14; Batch $1 / $4 per MTok. Microsoft 365 Copilot Business at $18 user/month annual or $25.20 monthly commitment for up to 300 users	Mixes two economic models: token-metered APIs and bundled seat licences. Copilot Chat is included for eligible Microsoft 365 users; custom agents can be metered.
AWS	Bedrock marketplace pricing across Standard/Flex/Priority/Reserved tiers; model-specific rates; batch discounts	Claude 3.5 Sonnet on Bedrock extended access $6 / $30 per MTok, batch $3 / $15; Intelligent Prompt Routing $1 per 1,000 requests	Bedrock is a platform layer, not a single model family. Prices vary by provider, region and tier; reserved throughput is usually sold through account teams.

Two strategic observations follow. Vendor list prices are converging downward for limited text-only work, especially for smaller models, but remain structurally asymmetric on output tokens. And the most important cost-control levers are now architectural rather than negotiational: caching, batching, routing simpler prompts to smaller models, and limiting unnecessary output length.

Three workload shapes, priced at list

The next table is intentionally operational. It shows what three common workload shapes look like priced only at current public text-token rates, excluding storage, search, seat licences, security tooling, observability, engineering labour and taxes.

Scenario	Monthly volume assumption	OpenAI GPT-5.4 mini	Anthropic Sonnet 4.6	Google Gemini 3.1 Pro Flex/Batch	Azure GPT-4.1 Global	AWS Bedrock Claude 3.5 Sonnet
Internal knowledge assistant	55,000 Q&A turns; 110M input + 33M output tokens	$231	$825	$308	$484	$1,650
Meeting-summary copilot	20,000 meetings; 160M input + 20M output tokens	$210	$780	$280	$480	$1,560
Large-scale customer-service bot	10M chats; 10,000M input + 2,500M output tokens	$18,750	$67,500	$25,000	$40,000	$135,000

These figures are illustrative, not contractual. They are useful for one insight. For many internal text workloads, model metering alone is not the dominant cost line. A 300-seat Microsoft 365 Copilot Business deployment at the annual-plan list price would cost roughly $5,400 per month before implementation work, already higher than the model meter for many modest internal API workloads. Once an organisation runs very high-volume support or consumer-facing workloads, or pushes into video, image, long-context or tool-heavy pipelines, inference becomes materially visible.

Throughput and quotas as real consumption signals

Because public token-volume disclosures are scarce, the best observable indicators of real enterprise usage are throughput quotas, conversations, documents and user adoption. OpenAI lists GPT-5.5 rate limits from 500,000 TPM at Tier 1 to 40,000,000 TPM at Tier 5. Anthropic measures rate limits in RPM, ITPM and OTPM, and notes that with a 2,000,000 ITPM limit and an 80% cache-hit rate, effective input throughput can reach 10,000,000 tokens per minute because cache reads do not count toward ITPM for most current models.

On platform layers, the numbers are larger still. AWS documents 6,000,000 cross-region TPM for Claude Sonnet 4.6 and 15,000,000 for Claude Opus 4.7 on Bedrock. Google Cloud documents 30,000 RPM per model per region as a system limit in Vertex AI, with examples such as 3.4M TPM per project for Gemini 2.0 Flash text workloads and 67,250 tokens per second of continuous throughput from 25 GSUs of provisioned Gemini 2.5 Flash. Azure frames quotas in TPM and RPM per region, subscription and model or deployment type, and positions PTUs as the predictable-capacity option for steady workloads.

Cost increases that are partly attributable to AI

The bluntest macro indicator is the IBM finding that the average cost of compute is expected to rise 89% between 2023 and 2025, with 70% of executives calling generative AI a critical driver. Strong directional signal, still survey evidence rather than audited accounts.

The public-company evidence is more precise but more nuanced: AI costs are usually visible, but mixed with hosting, support and platform growth rather than cleanly isolated.

Organisation	What changed	How much was clearly attributable to AI?	Reading
Duolingo	Q2 2025: gross margin declined 100 bps YoY due to higher AI costs from Duolingo Max expansion. Q3 2025: total gross margin fell to 72.5% from 72.9%; nine-month 72.0% from 73.1%, with management citing higher generative-AI costs and hosting	Partly specified: AI is named, exact split vs hosting unspecified	AI cost pressure showing up in margin, not as a separate line item
Duolingo	Q1 2026: gross margin expanded 190 bps YoY to 73.0%, driven primarily by reductions in per-unit AI costs	Clearly AI-related directionally, exact amount unspecified	Early margin compression can reverse as unit economics improve
Snowflake	FY2026 filing: cost of product revenue includes third-party cloud infrastructure expenses including those related to GPUs and AI inference, with some investments incurred ahead of revenue	Mixed attribution: AI explicit but bundled with support and international expansion	AI load is entering cloud cost of revenue but is not separably quantified
Klarna	Sales and marketing spend fell 11% in Q1 2024; the company said AI accounted for 37% of those savings, about $10M annualised	Directly AI-attributed by company	Rare case where management gives an explicit AI share of cost movement
Gartner benchmark	Typical deployment approaches can cost $5M-$20M; many projects abandoned after POC	Not company-specific	Use as a benchmark range, not a firm-level estimate

The conclusion: AI-attributable cost is often only partly isolatable in public reporting. In production, AI cost arrives as a bundle: model inference, GPU-heavy cloud workloads, enterprise search, data engineering, security tooling, support and adoption programmes. This is precisely why AI economics can look cheap in a lab and expensive in a finance review.

Business outcomes and ROI

The best evidence does not show a uniform AI productivity dividend. It shows bounded gains in bounded contexts, with the biggest pay-offs where the task is frequent, the workflow is measurable, and the error surface is controlled.

Organisation	Use case	Measured outcome	What the metric proves and does not prove
Morgan Stanley with OpenAI	Adviser knowledge assistant (AI @ Morgan Stanley Assistant) and meeting debrief (AI @ Morgan Stanley Debrief, OpenAI-powered)	98% of advisers use the OpenAI-powered assistant; document access rose from 20% to 80%; Debrief now generates client-meeting notes and follow-up drafts in production	Strong evidence of adoption and productivity support, with a public evaluation framework; no disclosed firm-wide ROI equation
GitHub with Accenture	Enterprise code assistant	8.69% more pull requests, 15% higher merge rate, 84% more successful builds; 90% of developers felt more fulfilled; 95% enjoyed coding more	High-quality enterprise evidence: combines telemetry and controlled study design
Klarna (2024 peak)	Customer-service assistant	2.3M conversations in month one; two-thirds of support chats; work equal to 700 full-time agents; average resolution time dropped from 11 minutes to under 2 minutes; "on par" CSAT	Strong efficiency signal at peak
Klarna (May 2025 reversal, Bloomberg)	Customer-service rebalance	CEO Sebastian Siemiatkowski publicly admits the AI-only approach went too far on cost; firm restarts hiring of human agents and commits to a hybrid model	Direct evidence that automation-only roll-outs hit a customer-experience ceiling; the case for AI as augmentation, not replacement
Klarna	Marketing operations	AI responsible for 37% of cost savings, about $10M annualised	Rare explicit AI share of cost savings
Duolingo	AI-enabled premium tier and content economics	Q1 2026 gross margin up 190 bps YoY, primarily from lower per-unit AI costs	ROI improves materially when unit AI costs fall after deployment and optimisation
Google Cloud customers	Customer support and research infrastructure	LUXGEN: AI agent reduced human customer-service workload by 30%. Citadel Securities: TPUs ran AI workloads up to 4x faster with 30% lower costs	Directional, vendor-selected case studies

At the market level, the most-quoted ROI multiples are still primarily survey- or sponsor-based and should be read as ambition-setting rather than audited returns. The credible counterweight is MIT NANDA's GenAI Divide finding (July 2025) that roughly 95% of generative-AI pilots are still failing to produce measurable P&L impact. Both numbers can be true at the same time: a small minority of firms are extracting outsized returns; the modal firm is not.

The sober ROI rule: AI creates measurable value when time saved is actually redeployed into throughput, sales, service or accuracy, not when the organisation simply observes that people finished a task faster. McKinsey 2025's finding that workflow redesign is the single strongest leadership predictor of high-performer status is the best one-line summary of this point.

Failure modes, hidden costs and organisational effects

The most expensive enterprise AI mistakes are usually not model mistakes. They are system-design and operating-model mistakes.

Hidden technical costs

The first is token inflation outside the base prompt. Tool schemas, system prompts, retrieved chunks, conversation history and model-specific tokenisers all change real spend. Anthropic notes that Opus 4.7 may use up to 35% more tokens for the same fixed text because of a new tokenizer; tool-use documentation shows extra system-prompt overheads of roughly 313-346 tokens per call once tools are enabled. A headline "$5 per MTok" can materially understate actual cost in a tool-heavy, poorly pruned implementation.

The second is latency as a function of verbosity. Google Cloud states explicitly that latency is directly proportional to generated token length and recommends constraining max_output_tokens to avoid needless delay and spend. Long outputs increase direct inference cost, queueing pressure, user wait time and reviewer burden.

The third is retrieval and data engineering debt. Production RAG means document ingestion, chunking, indexing, permissions replication, hybrid retrieval tuning, citation validation and connector maintenance. Azure, AWS and Google all treat search, document extraction and knowledge-layer design as first-class architecture components.

Quality, security and legal risk

Quality risk still centres on inaccuracy and hallucination, followed by privacy, cybersecurity and IP. McKinsey's State of AI 2025 reports that organisations are increasingly investing in mitigations for inaccuracy, cybersecurity and intellectual-property risk, but most still rely on a small set of controls. Independent reporting on production deployments shows that hallucinations, prompt injection and sensitive-data leakage remain the dominant failure modes when LLMs are exposed to user input or tool use without strong evaluation harnesses.

Standards bodies now treat this as a core management issue. NIST's AI RMF Generative AI Profile is the companion to AI RMF 1.0 for identifying and managing trustworthiness risks in generative-AI systems; SP 800-218A extends secure software development practices to generative AI and dual-use foundation models. The EU AI Act's Article 50 transparency obligations begin to apply on 2 August 2026, requiring providers and deployers to label AI-generated content and disclose interactions with AI systems. Together these frame oversight as a production discipline, not a prompt filter.

Organisational effects

The organisational impact is already material. IBM's 2026 CEO Study reports that 76% of organisations have appointed a Chief AI Officer, with stronger CHRO involvement and broad expectations for reskilling. McKinsey 2025 finds that the rise in AI roles is concentrated in technical and product profiles, while AI compliance and ethics specialists remain rare relative to the scale of risk and compliance work being centralised. That capability gap is one of the better explanations for why so many programmes stall at the governance layer rather than at the model layer.

Vendor lock-in is also deeper than many procurement teams first assumed. Prompt portability is relatively easy. Operational portability is not. The moment an organisation relies on Azure PTUs, Google GSUs, Anthropic-specific cache semantics, Bedrock routing tiers or proprietary SaaS-grounding layers, the switching cost moves from prompt text to runtime economics, observability, approval logic and governance. Anthropic's own documentation notes that its models run across the Claude API, Bedrock, Vertex AI and Microsoft Foundry, with differing regional and pricing treatments.

Shadow IT is no longer just a security issue. It is a procurement and FinOps issue. When secure chat is included in an existing suite, when agent builders are sold as low-code tools, and when model metering looks cheap in isolation, unmanaged sprawl becomes easy. Without central gateways, workspace limits, budgets and audit logs, organisations end up paying for duplicate copilots, duplicate connectors and duplicate review processes.

Governance and the AI FinOps playbook

The evidence points to a fairly clear set of practices.

First, establish a central AI control plane for data governance, model access, observability and policy, while leaving business-unit experimentation partly decentralised. This matches how risk and compliance are typically centralised while adoption and talent are more often hybrid.

Second, treat use-case selection as a portfolio problem. The best first-wave candidates are high-frequency tasks with manageable downside, clear baselines and measurable outputs: support draft responses, repo-grounded coding help, internal knowledge retrieval, meeting notes, document summarisation. Start with a single agent and add tools incrementally.

Third, build AI FinOps into delivery from day one. At minimum:

output caps and verbosity control;
prompt caching for repeated instructions and long context;
batch processing for asynchronous jobs;
routing simple requests to smaller models;
separate budgets for model spend, search/indexing and human review;
chargeback by business unit or workspace.

Fourth, require evidence of value, not just adoption. The right unit of account is the business process: resolution time, merge rate, cycle time, campaign throughput, gross margin, conversion, claims accuracy or working-capital velocity. Vendor-sponsored ROI multiples are useful for ambition; internal ROI should be measured against holdouts, pre/post baselines or randomised roll-outs wherever possible. The GitHub/Accenture study is a good model because it combines telemetry, user surveys and an explicit experimental design.

Fifth, tier human oversight by risk. External-facing or regulated content should have stronger review, provenance and escalation rules than low-risk internal drafting. NIST's AI RMF Generative AI Profile and SP 800-218A, plus the EU AI Act's Article 50 obligations from 2 August 2026, provide the best public templates for making oversight operational rather than aspirational.

Open questions

Three questions remain structurally hard to answer from public data. The first is exact enterprise token spend by customer: public disclosures remain sparse, so any cross-company token benchmark is still mostly inferential. The second is true all-in ROI: many vendor case studies disclose benefit but not the full cost stack, while many filings disclose cost pressure without a clean process-level counterfactual. The third is durability of uplift: short-run productivity gains are increasingly well evidenced, but the medium-run effects on error rates, organisational design, procurement consolidation and labour mix are still evolving. The Klarna May 2025 reversal and Duolingo's gross-margin swings show the economics can move quickly in both directions as deployment quality and unit costs change.

The highest-confidence conclusion, almost four years in, is already clear. The enterprise impact of the AI fever has been real but uneven. The most successful firms are not the ones with the loudest AI narrative. They are the ones converting model capability into measured workflow change, using architecture to control cost, and governing AI as a production system rather than a demo.

References

McKinsey & Company. The state of AI in 2025: Agents, innovation, and transformation. November 5, 2025. n=1,993.
MIT NANDA. The GenAI Divide: State of AI in Business 2025. July 2025.
Stanford HAI. Artificial Intelligence Index Report 2025.
IBM Institute for Business Value. 2026 CEO Study.
Gartner. Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept By End of 2025. July 29, 2024.
Gartner. Forecasts Worldwide GenAI Spending to Reach $644 Billion in 2025. March 31, 2025.
OpenAI. API pricing, rate-limit and enterprise documentation. platform.openai.com (captured May 2026).
Anthropic. Claude API pricing, prompt caching and tool-use documentation. docs.anthropic.com (captured May 2026).
Google Cloud. Vertex AI and Agent Platform pricing (Flex/Batch, Priority, GSUs).
Microsoft. Azure OpenAI pricing, PTUs, and Microsoft 365 Copilot Business pricing.
Amazon Web Services. Amazon Bedrock pricing across Standard, Flex, Priority and Reserved tiers.
Morgan Stanley. Launch of AI @ Morgan Stanley Debrief; OpenAI customer story on Morgan Stanley evals.
GitHub with Accenture. Enterprise Copilot deployment study.
Bloomberg. Klarna Turns From AI to Real Person Customer Service. May 8, 2025.
Reuters. Reporting on Klarna CEO commentary on the limits of AI cost-cutting in customer service, 2025.
Duolingo. Quarterly gross-margin disclosures, FY2025 to Q1 2026.
Snowflake. FY2026 filing on cost of product revenue.
NIST. AI Risk Management Framework Generative AI Profile and SP 800-218A.
European Commission. AI Act, Article 50 transparency obligations applicable from 2 August 2026.

Research summary

Abstract

This report synthesises evidence available up to 12 May 2026 on the enterprise impact of generative AI, almost four years after the public release of ChatGPT on 30 November 2022. It draws on McKinsey's State of AI 2025 (November 2025, n=1,993), MIT NANDA's The GenAI Divide (July 2025), IBM's 2026 CEO Study, the Stanford HAI AI Index 2025, Gartner's 2025 forecasts, vendor pricing and platform documentation (OpenAI, Anthropic, Google Cloud, Azure, AWS Bedrock), public-company filings (Duolingo, Snowflake, Klarna, Morgan Stanley) and standards guidance from NIST. The central finding is that AI is now near-universal in enterprise use but remains shallowly integrated. McKinsey reports that 88% of organizations regularly use AI in at least one function, yet only about one-third have begun to scale and only ~6% qualify as AI high performers attributing more than 5% of EBIT to AI. MIT NANDA estimates that 95% of generative-AI pilots are still failing to deliver measurable P&L impact. The firms compounding value route cheap models to cheap work, redesign workflows around AI and agents, measure baseline and uplift, and govern AI as a production system rather than a demo.

Research question

Almost four years into the generative-AI era, what has the enterprise actually bought, and where is the measured value, real cost, and operating-model gap that separates the firms compounding returns from the ones still piloting?

Methodology

Synthesis of McKinsey Global Survey on the state of AI (November 2025, n=1,993, fielded June 25-July 29, 2025); MIT NANDA The GenAI Divide: State of AI in Business 2025 (July 2025); IBM Institute for Business Value 2026 CEO Study; Stanford HAI Artificial Intelligence Index Report 2025; Gartner press releases on GenAI forecasts and POC abandonment (July 2024, March 2025); public vendor pricing pages and platform documentation (OpenAI, Anthropic, Google Cloud Vertex/Agent Platform, Azure OpenAI, Amazon Bedrock, Microsoft 365 Copilot) captured May 2026; SEC EDGAR filings and shareholder letters (Duolingo, Snowflake); official enterprise case studies (Morgan Stanley with OpenAI, GitHub with Accenture, Klarna, Google Cloud customer pages); Reuters and Bloomberg reporting on Klarna's May 2025 reversal of AI-only customer service; and standards guidance (NIST AI RMF Generative AI Profile, NIST SP 800-218A). Where vendors use seat pricing, provisioned capacity, or region-specific tiers, that is stated explicitly. Where a public source does not break out a number, it is shown as unspecified.

Key findings

McKinsey State of AI 2025 (Nov 2025, n=1,993): 88% of organizations now regularly use AI in at least one business function, up from 78% a year earlier, but only about one-third have begun to scale AI across the enterprise.
Only ~6% of organizations qualify as McKinsey AI high performers, defined as attributing more than 5% of EBIT and significant value to AI use. Only 39% of respondents attribute any EBIT impact to AI, and most of those say it is less than 5%.
Workflow redesign remains the single strongest leadership predictor of EBIT impact. AI high performers are nearly 3x more likely than peers to have fundamentally redesigned individual workflows around AI.
MIT NANDA, The GenAI Divide (July 2025): despite an estimated $30-40 billion of enterprise spend on generative AI, roughly 95% of corporate generative-AI pilots are failing to generate measurable P&L returns, with the gap concentrated in custom internal builds rather than purchased copilots.
Agentic AI is real but narrow: 23% of McKinsey respondents say their organization is scaling an agentic AI system somewhere, and 39% are experimenting, but in any individual business function no more than 10% are scaling agents.
IBM 2026 CEO Study: 76% of surveyed organizations now have a Chief AI Officer and 64% of CEOs are comfortable making major strategic decisions on AI-generated input, yet only 25% of the workforce uses AI regularly in daily work.
Gartner forecasts worldwide GenAI spending of $644 billion in 2025 (+76.4% YoY), while warning that at least 30% of generative-AI projects will be abandoned after proof of concept by end-2025, with typical deployment approaches running $5M to $20M.
Vendor list prices keep falling for small models but remain structurally asymmetric on output tokens. The most important cost-control levers are now architectural, not negotiational: caching, batching, routing simpler prompts to smaller models, and limiting unnecessary output length.
For many internal text workloads, model metering alone is not the dominant cost line. A 300-seat Microsoft 365 Copilot Business deployment at the annual list price runs roughly $5,400/month before implementation work, already higher than the model meter for many modest internal API workloads.
Klarna, May 2025: CEO Sebastian Siemiatkowski publicly reversed course, telling Bloomberg the company had cut customer-service costs too far with AI, restarted human-agent hiring, and committed to a hybrid model. Morgan Stanley's adviser assistant remained at 98% adoption with the OpenAI-powered Debrief tool now in production. The two firms bookend the 2025-2026 reality: real productivity wins, real ceiling on automation-only roll-outs.
Vendor lock-in has moved from prompt portability to operational portability: Azure PTUs, Google GSUs, Anthropic-specific cache semantics, Bedrock routing tiers and proprietary SaaS-grounding layers all shift the switching cost into runtime economics, observability, and governance.

Executive implications

Treat AI as an operating-model programme, not a software add-on. Workflow redesign is the strongest predictor of EBIT impact in McKinsey's data; CEO sponsorship without process engineering is the modal failure mode behind MIT NANDA's 95%.
Build an AI control plane on day one: data governance, model access, observability, evaluation, policy. Leave business-unit experimentation partly decentralised.
Run AI FinOps from day one: output caps, prompt caching, batch processing for asynchronous jobs, smart routing of simple requests to smaller models, separate budgets for model spend, search/indexing and human review, chargeback by business unit or workspace.
Pick first-wave use cases as a portfolio: high-frequency tasks with manageable downside, clear baselines, and measurable outputs. Start with a single agent and add tools incrementally rather than jumping to multi-agent systems.
Require evidence of value, not just adoption. The right unit of account is the business process: resolution time, merge rate, cycle time, gross margin, conversion. Use holdouts, pre/post baselines, or randomised roll-outs wherever possible.
Tier human oversight by risk. The Klarna 2025 reversal is the cautionary case: efficiency-only deployments hit a customer-experience ceiling. NIST AI RMF Generative AI Profile and SP 800-218A are the best public templates.
Plan for total cost of ownership, not headline token price. Tool schemas, system prompts, retrieved chunks and conversation history all inflate spend. Tokenizer choices and tool-use overheads can swing real spend by 30-50% silently.
Treat AI sovereignty and data residency as procurement criteria, not afterthoughts. The EU AI Act's Article 50 transparency obligations begin to apply on 2 August 2026; data zones, regional endpoints, audit logs and enterprise data-protection features are now material parts of the delivered cost stack.

Limitations

Public disclosures of exact enterprise token volumes remain sparse. Public companies and vendors much more often disclose users, conversations, documents, throughput quotas, or cost and margin effects than raw token totals. Scenario cost tables are computed from current public list prices captured in May 2026 and are illustrative, not contractual. Many vendor case studies disclose benefit but not the full cost stack; many filings disclose cost pressure without a clean process-level counterfactual. The MIT NANDA 95% figure is based on a multi-method study of more than 300 publicly disclosed AI initiatives plus interviews and surveys; it should be read as directional rather than as an audited industry census. Survey-based ROI figures, including the Microsoft-sponsored IDC averages, should be treated as directional rather than equivalent to independently audited returns.

References

McKinsey & Company. The state of AI in 2025: Agents, innovation, and transformation. November 5, 2025. n=1,993, fielded June 25-July 29, 2025.
MIT NANDA. The GenAI Divide: State of AI in Business 2025. July 2025.
Stanford HAI. Artificial Intelligence Index Report 2025.
IBM Institute for Business Value. 2026 CEO Study.
Gartner. Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept By End of 2025. July 29, 2024.
Gartner. Forecasts Worldwide GenAI Spending to Reach $644 Billion in 2025. March 31, 2025.
OpenAI. API pricing and rate-limit documentation. platform.openai.com (captured May 2026).
Anthropic. Claude API pricing, prompt caching, and tool-use documentation. docs.anthropic.com (captured May 2026).
Google Cloud. Vertex AI and Agent Platform pricing, including Flex/Batch and Priority tiers and provisioned throughput (GSUs).
Microsoft. Azure OpenAI pricing, PTUs, and Microsoft 365 Copilot Business pricing pages.
Amazon Web Services. Amazon Bedrock pricing across Standard, Flex, Priority and Reserved tiers.
Morgan Stanley with OpenAI. Adviser AI assistant adoption disclosures and AI @ Morgan Stanley Debrief launch (2024-2025).
GitHub with Accenture. Enterprise Copilot deployment study.
Bloomberg. Klarna Turns From AI to Real Person Customer Service. May 8, 2025.
Reuters. Klarna CEO commentary on the limits of AI cost-cutting in customer service, 2025.
Duolingo. Quarterly gross-margin disclosures, FY2025 to Q1 2026.
Snowflake. FY2026 filing on cost of product revenue including GPU and AI-inference workloads.
NIST. AI Risk Management Framework Generative AI Profile (AI RMF GenAI Profile) and SP 800-218A on secure software development for generative AI.
European Commission. AI Act, Article 50 transparency obligations applicable from 2 August 2026.

Key takeaways

The hype-to-impact gap is structural and now well measured. Sponsorship is high, workflow redesign is rare, and only ~6% of organizations attribute more than 5% of EBIT to AI in McKinsey's November 2025 survey.
MIT NANDA, July 2025: roughly 95% of generative-AI pilots are still failing to deliver measurable returns. The gap is concentrated in custom internal builds, not in purchased copilots.
Agentic AI is real but narrow. 23% of organizations are scaling at least one agent somewhere, 39% are experimenting, but no individual function shows more than 10% scaling agents.
What enterprises actually build is narrow: copilots, enterprise search, customer-service bots, code assistants, and bounded back-office agents. RAG and tool use have replaced 'one giant model'.
Token metering is rarely the dominant cost line for modest internal workloads. Seats, search, evaluation, security and human review usually outweigh the model bill until volumes get very large.

Decision brief

Enterprise AI fever becomes value only when a named business owner changes a real workflow and accepts a measurable outcome. Model capability, licenses, pilot count, and employee activity are leading indicators at best; none is proof of operating impact.

Decision criteria

The outcome already has an accountable business owner and baseline.
The workflow can change beyond the demonstration environment.
Value remains after adoption, integration, controls, and operating costs.

Evidence to check

Before-and-after cycle time, quality, revenue, risk, or cost.
Sustained usage by the people who own the workflow.
Net economics after implementation and human review are included.

What I would do Monday

Choose the three most visible AI initiatives and replace their activity metrics with one business outcome, one owner, one baseline, and one date by which the workflow must change.

Questions executives ask

Is enterprise AI delivering ROI in 2026?: Locally yes, enterprise-wide rarely. McKinsey's State of AI 2025 (November 2025, n=1,993) found that only 39% of respondents attribute any EBIT impact to AI and only ~6% qualify as AI high performers attributing more than 5% of EBIT and significant value. MIT NANDA's GenAI Divide (July 2025) estimates that 95% of generative-AI pilots are failing to deliver measurable returns despite an estimated $30-40 billion of enterprise spend. Business-unit wins are real (Morgan Stanley adviser adoption, GitHub/Accenture developer telemetry, focused customer-service deployments), but firm-wide P&L impact still requires workflow redesign, measurement and governance most organizations have not yet built.
What does an enterprise LLM workload actually cost?: Less than executives assume on the meter, more than they assume on total cost of ownership. For an internal knowledge assistant doing 55,000 Q&A turns per month (~110M input + 33M output tokens), public list prices captured in May 2026 range from roughly $231 on GPT-5.4 mini to $1,650 on Claude 3.5 Sonnet via Bedrock. A 300-seat Microsoft 365 Copilot Business deployment at the annual list price already runs about $5,400 per month, before implementation. The expensive parts usually sit outside the base token price: data connectors, permissions, evaluation harnesses, security reviews, observability, change management and human review.
Why do most generative-AI pilots stall?: Gartner forecasts that at least 30% of generative-AI projects will be abandoned after proof of concept by end-2025; MIT NANDA puts the share returning zero P&L impact at around 95%. The reasons are operational, not technical: poor data quality, inadequate risk controls, escalating costs, unclear business value, missing baselines, and no plan for redeploying time saved. McKinsey 2025 confirms it: workflow redesign is the strongest leadership predictor of EBIT impact, and AI high performers are nearly 3x more likely than peers to have fundamentally redesigned a workflow. Pilot success says almost nothing about production economics.
What is the right operating model for enterprise AI?: A central control plane (data governance, model access, observability, policy, evaluation) combined with partly decentralised business-unit experimentation. Use a portfolio approach to use-case selection: start with high-frequency tasks where the workflow is measurable and the error surface is controlled. Run AI FinOps from day one: output caps, prompt caching, batch processing, routing simple requests to smaller models, separate budgets for model spend, search and human review. Tier human oversight by risk. Klarna's May 2025 reversal is the cautionary case study on going too far on automation alone.
Should we build, buy, or partner for enterprise AI?: All three, sequenced by use-case maturity and switching cost. Buy seat-based copilots for productivity work where vendor-managed grounding is acceptable. Build only where you need proprietary workflow logic, data residency, or model-routing economics that off-the-shelf platforms cannot give you. Partner for edge capabilities (vertical models, evaluation, governance tooling) where the market is moving faster than internal cycles. Be aware that vendor lock-in has moved from prompt portability to operational portability: PTUs, GSUs, cache semantics and routing tiers compound over time. MIT NANDA's 2025 data suggests purchased copilots currently outperform custom internal builds on adoption and ROI.
How do I measure AI impact credibly to my board?: Measure the business process, not the activity. Resolution time, merge rate, cycle time, conversion, gross margin, claims accuracy and working-capital velocity are the right units of account. Use holdouts, pre/post baselines or randomised roll-outs where possible. The GitHub/Accenture study is the strongest public template: it combines telemetry, user surveys and an explicit experimental design. Treat vendor-sponsored ROI multiples as ambition-setting, not as audited returns; treat MIT NANDA's 95%-fail and McKinsey's 6%-high-performer figures as the realistic outer bounds of where most programmes currently sit.

Canonical URL: https://juanbeltran.ch/blog/enterprise-impact-ai-fever