We looked at 30 client AI rollouts. Here's what actually broke.
Over the last 14 months we've worked alongside 30 engineering teams shipping AI features into production. We collected anonymized failure-mode data from incident reviews, retros, and the ugly Slack threads that live in #ai-incidents. Here's what we found.
The headline finding is unsurprising once you've lived through it: the model is almost never the problem. When AI features blow up in production, it's because of the boring scaffolding around them. The failures cluster into a small number of categories, and most of them are preventable with engineering work that has nothing to do with prompts.
The dataset
30 organizations, ranging from 25-person Series A startups to Fortune 500 platform teams. Industries: SaaS, fintech, healthtech, e-commerce, devtools, two government suppliers. Each rollout had at least one user-facing AI feature live with real traffic for ≥30 days. We logged every incident severity-2 and above, plus all silent regressions caught by evals or replays. Total: 187 incidents.
What broke, ranked
The pattern
Add up the top three categories: 67% of incidents are operational. They'd happen with any technology, but AI amplifies them because the systems are non-deterministic, expensive per-call, and frequently customer-visible.
The teams in our dataset that had the fewest incidents weren't using better models. They were doing four things consistently:
- Pinning model versions and running CI evals on every prompt change. Boring. Prevented 31% of incidents.
- Per-request cost tracing and per-feature budgets. Cheap to implement, cuts the second-most-common failure category to nearly zero.
- Treating retrieval as a real data pipeline with owners, freshness SLOs, and recall metrics. Not "we'll re-index when someone notices."
- Hardening tool/MCP servers like real services: timeouts, retries with backoff, idempotency keys, circuit breakers.
That's the playbook. None of it requires fancy frameworks. All of it requires engineering rigor that the AI-feature-velocity narrative actively works against.
What we'd do tomorrow if we were starting fresh
If we were briefing an engineering team starting a new AI feature in production, the order would be:
- Write 30 evaluation cases before writing the prompt. Run them in CI. Fail the build on regression.
- Instrument cost per request before instrumenting anything else. You'll need it within 30 days regardless.
- Pin the model version. Treat upgrades like dependency upgrades — test, stage, roll forward.
- Put a redaction layer between your app and any LLM provider. Even non-regulated data deserves it.
- Add per-feature token budgets with circuit-breakers. Default to "fail closed" when exceeded.
Caveats
30 is not a huge sample. Selection bias is real — these are organizations that hired us, which skews toward teams already aware of the gap. We've intentionally not named clients or quoted exact figures from any single engagement; aggregates and percentages only. The full methodology is available on request.
Want the prioritized version of this for your team? Take our 3-minute readiness quiz — the output is built directly on this dataset. Or talk to us about a written assessment.