▸ Original research

We looked at 30 client AI rollouts. Here's what actually broke.

QuantAimLabs Research Published Nov 12, 2025 8 min read

Over the last 14 months we've worked alongside 30 engineering teams shipping AI features into production. We collected anonymized failure-mode data from incident reviews, retros, and the ugly Slack threads that live in #ai-incidents. Here's what we found.

The headline finding is unsurprising once you've lived through it: the model is almost never the problem. When AI features blow up in production, it's because of the boring scaffolding around them. The failures cluster into a small number of categories, and most of them are preventable with engineering work that has nothing to do with prompts.

The dataset

30 organizations, ranging from 25-person Series A startups to Fortune 500 platform teams. Industries: SaaS, fintech, healthtech, e-commerce, devtools, two government suppliers. Each rollout had at least one user-facing AI feature live with real traffic for ≥30 days. We logged every incident severity-2 and above, plus all silent regressions caught by evals or replays. Total: 187 incidents.

What broke, ranked

31%

Prompt regressions caused by silent vendor model updatesA "minor" model version bump shifted output structure. Downstream parsers broke. Caught by users, not us. Almost universally preventable with pinned model versions and CI evals on every prompt change.

22%

Cost runaway from a single bad prompt or feature flagOne feature shipped without per-request token budgets. Quote from one CTO: "We found out at the credit card." Median overspend before detection: 9 days. Cost guardrails are a 4-hour engineering task most teams keep deferring.

14%

Retrieval (RAG) returning stale or wrong contextEmbeddings re-built monthly while source data changed daily. Or chunking strategy didn't survive a doc-format migration. Pure ops hygiene problem. Owned by no one in 9 of 14 cases.

11%

Tool/MCP server timing out under real concurrencyTool calls tested at 1 rps, deployed at 50. No connection pooling, no circuit breakers. The model handled it; the boring HTTP layer didn't.

Sensitive data leaked into logs or third-party vendorsProduction prompts contained PII because nobody redacted at the gateway. Discovered during routine audit, not by users — but it's the failure mode that ends careers.

Agent loops without proper iteration limitsAgent gets stuck retrying the same broken tool call. We saw one rack up $4,200 in 6 hours before the budget alarm fired (which existed only because we'd installed it the prior week).

Actual model quality issuesReal quality regression — the model got worse at the task. Notable for being the rarest cause, despite being the one most teams plan for.

OtherMisc: provider outages, certificate rotation breakage, an intern.

The pattern

Add up the top three categories: 67% of incidents are operational. They'd happen with any technology, but AI amplifies them because the systems are non-deterministic, expensive per-call, and frequently customer-visible.

The teams in our dataset that had the fewest incidents weren't using better models. They were doing four things consistently:

Pinning model versions and running CI evals on every prompt change. Boring. Prevented 31% of incidents.
Per-request cost tracing and per-feature budgets. Cheap to implement, cuts the second-most-common failure category to nearly zero.
Treating retrieval as a real data pipeline with owners, freshness SLOs, and recall metrics. Not "we'll re-index when someone notices."
Hardening tool/MCP servers like real services: timeouts, retries with backoff, idempotency keys, circuit breakers.

That's the playbook. None of it requires fancy frameworks. All of it requires engineering rigor that the AI-feature-velocity narrative actively works against.

What we'd do tomorrow if we were starting fresh

If we were briefing an engineering team starting a new AI feature in production, the order would be:

Write 30 evaluation cases before writing the prompt. Run them in CI. Fail the build on regression.
Instrument cost per request before instrumenting anything else. You'll need it within 30 days regardless.
Pin the model version. Treat upgrades like dependency upgrades — test, stage, roll forward.
Put a redaction layer between your app and any LLM provider. Even non-regulated data deserves it.
Add per-feature token budgets with circuit-breakers. Default to "fail closed" when exceeded.

Caveats

30 is not a huge sample. Selection bias is real — these are organizations that hired us, which skews toward teams already aware of the gap. We've intentionally not named clients or quoted exact figures from any single engagement; aggregates and percentages only. The full methodology is available on request.

Want the prioritized version of this for your team? Take our 3-minute readiness quiz — the output is built directly on this dataset. Or talk to us about a written assessment.