Monthly Archives: May 2026

The AI bill is coming due

By Eric Picard

A story is making the rounds in business circles. AI is being sold to us at investor subsidized prices because the model labs are losing money on every prompt. When the venture capital subsidy ends, prices will jump several-fold, and any company that bet big on AI will find itself holding an existential bag. Adopt extensively, the argument goes, and you put your business at risk.

It’s not entirely wrong. But it points at the wrong problem, and it misses where the real exposure sits. Let’s work through what’s happening.

The cost curve has been collapsing

The cost to run a model with GPT-4-level intelligence has dropped roughly 50x in three years. At GPT-4’s launch in March 2023, that capability cost about $20 per million tokens. Today it costs $0.40 or less. Pull back further and the picture is even more dramatic: for GPT-3-level intelligence, the cost has fallen roughly 1,000x since 2021, going from around $60 per million tokens to about $0.06. Either way you slice it, this is one of the fastest cost declines in the history of computing — faster than anything Moore’s Law ever produced.

That decline came from four compounding sources, each contributing across multiple model generations: better hardware, better inference software, better model architectures like Mixture of Experts, and quantization techniques that run models at lower numerical precision. Each of these has delivered roughly 2-4x improvements per generation, and they stack — across several generations of each, you get the orders of magnitude we’ve actually seen.

Worth noticing what’s not on that list: Moore’s Law and quantum computing. Quantum has no demonstrated path to running AI inference cheaper than GPUs in this decade, and probably not the next. If your optimism is anchored on quantum, you’re betting on a horse that hasn’t entered the race. The horses running today — inference-optimized silicon, sparse architectures, aggressive quantization, semantic caching, model routing — are real, shipping, and doing the work.

The labs are figuring it out faster than people realize

OpenAI generated about $20 billion in revenue in 2025 and is projected to burn $17 billion in cash in 2026. Their gross margin sits in the 33-48% range depending on whose leak you trust. Those numbers feed the doomer story. But they describe one company’s strategy, not the technology’s economics.

The more informative number is Anthropic’s gross margin trajectory: negative 94% in 2024, roughly 40% in 2025 (revised down from an original 50% projection when inference costs came in 23% higher than expected), and a target of 77% by 2028. Anthropic figured out early they’d rather sell to enterprises than subsidize a billion free consumer users. Roughly 80% of their revenue now comes from business customers, and that mix is what makes the path to 77% margins plausible at all. OpenAI is in a stickier spot because they built a consumer brand they can’t easily walk back from. You can’t tell 850 million free users “actually, that’ll be $30 a month now” without detonating the moat that scale was supposed to create.

When people say “AI is unprofitable,” what they usually mean is “OpenAI’s consumer business is unprofitable.” Different statement.

The real exposure is somewhere else entirely

If frontier model prices double or triple at the top tier, it doesn’t matter much for most businesses. Roughly 90% of real-world AI tasks — summarization, classification, drafting, retrieval, basic analysis — don’t require the smartest model available. They run fine on tier-two cloud models or open-weight models that now lag the frontier by about three months on standard benchmarks. That gap is shrinking. A sensible enterprise with disciplined model routing finds a price hike on Opus or GPT-5 annoying, not existential.

The exposure sits in usage patterns, not pricing.

Cost per token has dropped 50x at the frontier and 1,000x at older tiers. But agentic workflows — autonomous systems that reason in loops, call tools, and self-correct — consume 5 to 30 times more tokens per task than a single chatbot query. Cheaper inference doesn’t mean lower bills when usage is exploding. This is Jevons paradox in real time: a 19th century observation by economist William Stanley Jevons that improving steam engine fuel efficiency didn’t reduce coal consumption — it increased it, because cheaper coal opened up uses that hadn’t been economical before. When something gets cheaper, we use vastly more of it, and total spend goes up.

The companies blowing past their AI budgets aren’t doing so because the labs ended a subsidy. They’re doing so because they wired Claude Opus or GPT-5 into every step of every agentic workflow without asking whether each step required a frontier model. They got it working with the smartest model available, shipped it, and never went back to optimize.

What this means for business leaders

The right question isn’t “what’s our spend on AI?” It’s “what’s our cost per resolved task, and how does it compare to the human-equivalent cost?”

If you’re spending $4 per AI-resolved customer service ticket and a human agent could resolve it for $3, you have a zombie agent draining your quarterly budget. The wrong response is to panic about AI cost. The right response is to route 80% of those tickets to a small fast model that handles them for ten cents each, and reserve the frontier model for the 20% of cases that genuinely require it.

This discipline has a name: FinOps for AI. It’s the same discipline mature cloud-native companies built ten years ago for AWS, GCP, and Azure spend. Tag every model call. Make the cost visible to the developer making the architectural choice. Build dashboards that show cost per task type, escalation rate to the expensive model, and trend over time.

When developers can see their choices reflected in cost, they make better choices. When they can’t, they don’t.

The architectural piece matters too. Don’t build a system where every call goes to a frontier model just because frontier was easiest to wire up during the prototype. Build tiered systems. Small fast models for routine work, mid-tier for the harder stuff, frontier only for what genuinely requires it. Add semantic caching so you’re not paying twice to generate the same answer for slightly differently worded queries.

This isn’t speculative. The companies running AI profitably already operate this way.

The cost curve is real and compounding fast. The labs are figuring out unit economics in front of us, not collapsing under them. We don’t need quantum or Moore’s Law to keep delivering — we need MoE, quantization, and inference ASICs to keep doing what they’re already doing.

The companies that will get hurt aren’t the ones that adopted AI extensively. They’re the ones that adopted it naively — without the discipline to route workloads, cache aggressively, and meter every model call. The companies that will thrive are treating AI compute the way mature companies have treated cloud infrastructure for a decade: as a real cost center that needs real engineering discipline, not a magic productivity tool that’s free until further notice.

The AI bill is coming due. It’s coming due for the people who didn’t read it.

Tagged , , , ,