Skip to content
All posts

Shipping AI features without betting the company on them

AI features are easy to demo and hard to make reliable. The boring approach treats the model as one unreliable dependency inside a system you still control.

Shubham Somani 3 min read

Every team we talk to wants an AI feature, and most of them can build a convincing demo in an afternoon. The gap between that demo and something you’d put in front of paying customers is the whole job — and it’s mostly boring engineering.

Here’s how we think about it.

Treat the model as an unreliable dependency

The mistake is treating the model like a function that returns the right answer. It isn’t. It’s a network call to a third party that is occasionally slow, occasionally down, occasionally wrong with total confidence, and occasionally more expensive than you budgeted.

You already know how to handle dependencies like that. Timeouts. Retries with backoff. Circuit breakers. Fallbacks. Caching. A budget. The fact that the dependency is a language model doesn’t change the playbook — it just makes the playbook mandatory instead of optional.

Constrain the surface area

A free-text box wired straight to a model is the hardest possible thing to make reliable, because the input space and output space are both infinite. Boring AI features shrink both:

  • Structured output. Ask for JSON against a schema and validate it. If it doesn’t parse, you retry or fall back — you never ship an unvalidated model response into your system.
  • A bounded action set. Don’t let the model do anything. Let it choose from a list of things you’ve already made safe. The model picks; your code executes.
  • A human in the loop where the cost of wrong is high. Drafting an email is fine to automate. Sending money is not. Put a person on the steps where a confident wrong answer is expensive.

Evaluate like it’s software, because it is

You wouldn’t ship a payments change without tests. AI features get the same treatment, adapted:

  • A small eval set of real inputs with known-good outputs, run in CI. When you change the prompt or swap the model, you find out immediately if quality dropped.
  • Logging of inputs and outputs (scrubbed of anything sensitive) so you can see what’s actually happening in production, not what you hoped.
  • A regression habit: every bad output a user reports becomes a new case in the eval set. Your test suite gets smarter every time something goes wrong.

A prompt without an eval set is a vibe, not a feature.

Plan for the model to change under you

The model you launch on will be deprecated. Prices will change. A new version will behave subtly differently. So:

  • Keep the model name and prompt in config, not scattered through the code.
  • Put the provider behind a thin interface so swapping vendors is a day, not a quarter.
  • Track cost per request as a first-class metric. AI features have a unit economics problem hiding inside them, and you want to see it before finance does.

The boring AI checklist

Before an AI feature ships, it has:

  • Timeouts, retries, and a graceful fallback for when the model is slow or down.
  • Structured, schema-validated output — never raw text straight into your system.
  • A bounded set of actions, with a human gate on anything irreversible or expensive.
  • An eval set running in CI.
  • Logging and a cost-per-request metric.
  • The provider behind an interface you can swap.

We’re genuinely excited about what these models can do. That’s exactly why we wrap them in boring, unglamorous scaffolding — so the exciting part can fail safely, cheaply, and visibly, instead of taking the product down with it.