May 12, 202610 min read

Production over demos: shipping LLM features that survive real users

A prompt that works in a demo is a hypothesis. A feature that holds up under real users is an engineered system. The gap between the two is concrete and closeable — here is how.

It has never been easier to build something that looks like an AI product. A well-crafted system prompt, a capable frontier model, and an afternoon's work will produce a demo that makes people lean in. The trouble starts the moment real users arrive — with inputs you didn't anticipate, expectations you didn't set, and edge cases no demo ever exercises.

The distance between a convincing demo and a dependable feature is where most AI projects quietly fail. It isn't a gap you can close by using a more powerful model or rewriting the prompt one more time. It's a gap you close with the same engineering discipline you apply to any other production system: rigorous measurement, controlled rollout, continuous monitoring, and a serious plan for when things go wrong.

A demo proves a possibility; a feature makes a promise

A demo shows that the system can produce a good answer once, under conditions you control. A feature promises that it will produce an acceptable answer for the vast majority of real inputs, at a cost and latency budget the product can afford, across the full distribution of users — including the ones who phrase things oddly, attach unexpected file formats, or push the system into territory you never imagined.

Those are fundamentally different claims, and only the second one is worth deploying. Treating the LLM like any other external dependency helps recalibrate expectations: you wouldn't ship a payments integration you'd tested exactly once by hand. The same standard applies here. The model is a dependency with a probabilistic output contract, and your job is to build a system whose behavior is acceptable even when the model's output is imperfect.

The practical implication is that the first thing a feature needs is not a better prompt — it's a definition of acceptable. What does 'good enough' look like for this task? What outputs are actively harmful? Where is the cliff edge? Without that definition you have no basis for measurement, and without measurement you have no engineering.

Build the evaluation harness before the feature

The most valuable thing you can build before writing a single line of product code is a representative evaluation set. Not a comprehensive one — a representative one. Forty well-chosen examples that span the realistic input distribution, each annotated with what a good output looks like, will catch more regressions than four thousand synthetic examples that cluster around easy cases.

The right judge depends on the task. For structured extraction, exact match or schema validation is fast and unambiguous. For summarization or free-form generation, a rubric evaluated by a second model (used consistently, with a documented prompt) gives you a score you can track over time. For high-stakes cases, a human spot-check on a fixed sample provides a ground truth that nothing else can fully substitute. Most real features need a combination: a fast automated pass that runs on every change, and a slower human review that runs on significant changes or when automated scores move unexpectedly.

What matters most is that evaluation runs automatically and its results are versioned alongside the code. A quality score that lives in someone's head, or in a spreadsheet disconnected from the repository, isn't an engineering artifact — it's an opinion. The moment you attach a number to a commit, you've turned prompt engineering into software engineering.

Prompt management is configuration management

Prompts are not documentation. They are executable configuration that directly determines the behavior of a production system. Changing a system prompt in a notebook and redeploying without a diff, a review, or an eval run is the equivalent of editing an nginx config directly on a production server: it works until it doesn't, and when it doesn't you have no record of what changed.

Version every prompt with the codebase. Require that any prompt change be accompanied by an eval run that shows whether the change improved, degraded, or had no measurable effect on the target metrics. Make the history auditable. This sounds bureaucratic until the first time a prompt change causes a subtle regression that only shows up three weeks later when a user files a support ticket — at which point you'll wish you had a git blame for prompts.

This also forces a useful discipline: you can no longer claim a prompt change is 'obviously better' without evidence. The eval set is the arbiter, and its verdict is a number, not a feeling.

Observability is not optional

In production you need to answer 'what happened to this specific request?' quickly, without reproducing it. That requires tracing each inference call end-to-end: the exact prompt sent (after template resolution and context injection), the retrieved documents or tool outputs that shaped it, the raw model response before any post-processing, the latency at each stage, and the cost. Anything less and you're debugging by guesswork.

Structured traces also make patterns visible at scale. If your extraction feature starts failing on a new document format that a supplier started using last Tuesday, you want to catch that from a spike in error rates, not from a trickle of user complaints. Aggregate the traces, track the distributions, and set alerting thresholds before you need them.

Cost and latency belong on the same dashboard as quality because all three are constraints that interact. A prompt change that improves quality scores by five percent but doubles token consumption per request may not be a net win depending on the traffic volume and the margin structure. You can only make that tradeoff consciously if you're measuring all three simultaneously.

Gradual rollout and graceful degradation

Release AI features the way you would release any change with uncertain variance: behind a flag, to a small percentage of traffic, with a rollback plan and a clear criterion for what triggers it. This isn't excessive caution — it's how you get real-world signal without betting the product on a single deploy.

Design for degradation before you design for success. What does the feature do when the model returns a malformed response? When latency spikes to 15 seconds? When the context window fills up mid-document? If the answer is 'it crashes' or 'it silently returns wrong data,' those are the things to fix before launch, not after. An AI feature that fails loudly and falls back gracefully is far more trustworthy than one that mostly works perfectly and occasionally corrupts state.

For actions that have real-world consequences — sending an email, writing to a database, placing an order — require explicit human confirmation even when the model's confidence looks high. The cost of a false confirmation is almost always lower than the cost of an unrecoverable automated action. Keep humans in the loop where the stakes justify it, and make that a deliberate architectural choice rather than something you retrofit after the first incident.

The feedback loop that keeps it from drifting

Production AI systems drift in two ways. The model weights change when providers update without announcing, altering subtle behaviors. The input distribution changes as users learn what the system does and probe its edges. Neither drift is dramatic on any given day, and both are invisible without monitoring.

Close the loop: log failures (especially user-facing corrections and explicit negative signals), triage them weekly, and fold the most instructive ones back into the eval set. A living eval set that grows toward the real distribution is the main thing separating a system that ages well from one that degrades silently until something breaks badly enough to notice.

None of this is unique to AI. It is the normal discipline of running reliable software — applied early, seriously, to a component that happens to be probabilistic. The teams that treat that component as magic tend to end up with systems that are magical exactly as long as the demo conditions hold.

GenAIEngineering

May 12, 202610 min read

Production over demos: shipping LLM features that survive real users

A prompt that works in a demo is a hypothesis. A feature that holds up under real users is an engineered system. The gap between the two is concrete and closeable — here is how.

A demo proves a possibility; a feature makes a promise

Build the evaluation harness before the feature

Prompt management is configuration management

This also forces a useful discipline: you can no longer claim a prompt change is 'obviously better' without evidence. The eval set is the arbiter, and its verdict is a number, not a feeling.

Observability is not optional

Gradual rollout and graceful degradation

The feedback loop that keeps it from drifting

GenAIEngineering

Production over demos: shipping LLM features that survive real users

A demo proves a possibility; a feature makes a promise

Build the evaluation harness before the feature

Prompt management is configuration management

Observability is not optional

Gradual rollout and graceful degradation

The feedback loop that keeps it from drifting

More insights

Evals as a first-class artifact

The context window is not memory: designing stateful AI systems

Production over demos: shipping LLM features that survive real users

A demo proves a possibility; a feature makes a promise

Build the evaluation harness before the feature

Prompt management is configuration management

Observability is not optional

Gradual rollout and graceful degradation

The feedback loop that keeps it from drifting

More insights

Evals as a first-class artifact

The context window is not memory: designing stateful AI systems