8 min read
The Economics of Bad AI

How the world works doesn’t have a universal answer. Most companies of reasonable size are fully fledged ecosystems with factions, politics, survival dynamics and Darwinian processes, rather than neat input/output boxes waiting to be disrupted by an AI tool.

AIs today tend to live between two not-so-great poles: one of accumulated organizational noise (Notion, Confluence, the usual suspects), the other of hard-coded configs. The middle, which describes the organization and its processes as a living system, is largely missing and the result is predictable: we build or buy systems that look cheap at the surface API layer and are expensive everywhere else.

Barring the (important) discussions on social and economic dynamics, for AI to be meaningfully useful, there are a few sources of tension that have to be reconciled moving forward and they’re not clearly separable:

The Noise Generators Argument

Race condition: model improvement at viable price points vs. the market learning to recognize and/or accept slop. Slop saturation is reaching critical mass in some pockets, with workslop starting to even have a price tag attached to it and doing reputational damage to how work colleagues are perceived. Most studies apply to collateral generated with an LLM sidekick. The same issue arises from “AI for X” where, in cases where the knowledge is not externalized, the output requires the same expertise to evaluate it as the one it was supposed to replace.

My guess is that the bet fundamentally boils down to assuming that the speed at which humans develop a distaste for sub-mediocre output will be lower than the speed at which strong models get to a lower cost point and access to externalized knowledge is solved. Despite being in our AI-honeymoon moment, where token-price economics looks bleak (GPT-5 to GPT-5.5 was a de-facto 8x price hike on input tokens) but is offset by available capital, we can’t offer an Opus-class model buffet for a $50 subscription. Them’s the breaks.

$50 is a prototyping budget, not an operations budget. Suppose you offer agent access to a company, coupled with RAG, intelligent routing, prompt caching, all the good stuff. Let’s put a shopping list together:

ModelBucketOutputs (choose one)
Gemini 3.1 Flash Lite$20 / ~22.8M tokens~900 document first drafts OR ~4,000 ticket classifications OR ~600 marketing copy variants OR ~300 RAG answers over long context
Claude Haiku 4.5$7 / ~2.3M tokens~700 customer-facing support replies (multi-turn) OR ~200 polished short-form pieces (FAQ-level)
Claude Sonnet 4.6$15 / ~1.7M tokens~20 “polished” end-to-end documents OR ~75 marketing campaign assets OR ~50 escalated support sessions OR ~15 multi-step workflow runs with tool use
Claude Opus 4.7$5 / ~330K tokens~3 substantive autonomous agent runs OR ~10 high-stakes documents (something you’d feel comfortable with as a board memo - even that is a stretch)

I’ve (accidentally, admittedly) used Opus to slice and dice OpenRouter’s pricing and to reason with it to get this table to a not-hand-wavy stage. Gemini would have cost 17 cents to do it, Sonnet $2, Opus closer to $4 (maybe $2 with system-prompt caching). Now imagine you’re operating a business and are trying to do real work, rather than napkin math for a blog. I just burned 7% of that toy-allocation in a short design conversation.

Now you probably don’t need an Opus-class model to generate customer replies, but you also can’t expect polished documents to be generated by Haiku or Gemini Flash-class models either. And, honestly, if you do, please re-think your standards. Anyone who has used these tools knows very well the quality drop can be quite brutal.

This table will, of course, age like milk. Whether the price will drop or not, or whether we will see incremental updates with high price jumps, is a matter of who you ask. The only point I’m trying to make is that, to conceivably up the game in deliverable and decision-making quality, at least one of these things needs to happen:

  1. Models get dramatically cheaper at the same quality level or higher.
  2. Domain knowledge gets externalized (context graph, md files, clever prompting) and cheap models can do more with structured context.
  3. HITL becomes ubiquitous for any work where we’re unsure if we’re even operating on the Pareto line. Without externalized knowledge, this is an unbounded human and expertise cost that is barely talked about.

I’m interested in #2 and #3 next.

The Neural Sandwich: Electric Boogaloo

The two-job split I’ve discussed before where we assemble the context deterministically and reason over it neurally becomes an economic argument as much as it is an architectural one.

The RAG version of this is that the model gets a bag of retrieved chunks, some relevant, some not, with temporal pinning that is largely accidental (in that we are hoping for an IDF-like approach to surface dates). Afterwards, the model has to figure out how all of this relates in an inference call whose audit trail stops at attribution: a stale chunk that’s cited is still, technically, accurate.

Loosely speaking: let’s not overpay for a Sonnet-class model to reason its way over RAG to an output that is logically anchored in first principles and contextual information, and let’s pay for a cheap model to reason over hard work that was done for cheap, mechanically.

By having a substrate that captures temporal validity, entity resolution, authority chains, de-facto workflows (rather than conflicting de-jure information in Confluence), you can take a meaningful step towards making small models viable and elevate heavier models.

A reasonably sharp pushback could be: “the info is already in these systems (maybe?), so imperfect reasoning over it is better than trying to solve the cold start problem and, then, applying simpler reasoning models over the context”. This certainly has real force but stops being valid once you go past low-stakes work and discovery work. Basically, anything where the cost of being wrong is somehow equal to the cost of being slow. “Just let the agent try” is a great way to get to grips with a messy reality before earning the right to build the substrate, but this stays, squarely, in the PoC field, in my view.

Unbounded Human-in-the-Loop (HITL)

This is a big one. Every escalation bears the full cost of human intervention, context-switching, time, expertise required for evaluation, and, barring luxuries such as hard-coded thresholds against some variable, is a cognitive decision itself.

The cost categories haven’t caught up with this new world either: if the AI’s labor input is five minutes of a controller’s attention twelve times per day, it’s a diffused cost that is difficult to get under control. When finance functions catch on that unbounded operating cost has been chosen over bounded capital cost, there is a non-zero chance that HITL escalations will become a key KPI in AI adoption.

I’ve talked about collapsing organizational and process hotspots into lower-dimensional machine-readable representations. That sounds dandy, but the ugly truth of the matter is that separating “crap” from “meaningful work” is not analytically tractable and can’t be sorted like a list.

Inevitably, the cost of maintaining a substrate will be weighed against the cost of token spend and occasional failure for a more opportunistic agentic system that “just tries things out”. That is not the best strategy: I think the right cost to balance is HITL vs. the cost of the substrate and they are, interestingly, inversely proportional.

An organization’s ability to transform HITL escalations into an incremental enrichment step in their decision substrate will drive the “cost” of the next escalation down. That there’s no data ownership framework to manage this and all HITL escalations land in a vendor’s moat, robbing companies of optionality in the long term, should give everyone pause.

Putting It All Together

“The Economics of Bad AI” isn’t a question of “AI bad” or “models are bad”. It’s more about the economics of doing it well being structured to favor the wrong investments. Used tokens go on a dashboard; substrate maintenance is tedious work that is invisible and goes on someone’s calendar; HITL cost is mostly invisible and goes on someone’s cognitive load instead, conveniently absent from any COGS line.

We can’t afford Opus everywhere. But with appropriate work in substrate and ownership of HITL infrastructure by absorbing escalation wisdom, small (and cheap) models become viable and heavy models can stretch their legs in novel situations.

As it stands, the perverse outcome is that we are currently incentivized to create systems that look promising on the P&L side, deliver subpar results and burn value through the floor.