Site icon Farid Fadaie

AI Voice Agent Architecture: What I Learned Building the Same Agent Three Times

Blue phone on the left with a teal call path passing time icons (hourglass, clock) and ending at a glowing target on the right.

I built the same production voice agent three times. Same requirements, same telephony stack, same speech models available — three fundamentally different architectures. The first one collapsed under its own coupling: every bug fix broke something else. The second one was controllable and correct, and it destroyed the one thing a voice agent exists for: responding like a human, immediately. The third one is the only one that survived contact with real callers.

This post is the architecture write-up I wish I’d read before spending months and real money finding out the hard way. It’s about one question that turns out to decide everything: who owns the next turn?

The problem shape

The agent answers real phone calls. It has to do three things at once, and they pull in different directions:

Every architecture below is a different answer to where those three responsibilities live.

Architecture 1: the server-side orchestrator

Architecture 1: every caller turn fans out to parallel classifiers; shared flags couple everything to everything.

The first build treated the speech model as a mouthpiece. The server owned everything: each caller utterance was transcribed and fanned out to a battery of parallel classifiers — one deciding the call category, one watching for a provider name, one watching for end-of-call intent, one detecting reference-only questions, plus hand-maintained regex banks for language detection (including phonetic transliterations of “can you speak X to me?” in half a dozen scripts). A merge layer combined their outputs into a “turn analysis,” and a policy engine picked what to say next, often from scripted prompts.

Why it’s tempting

Why it collapsed

The deep lesson from architecture 1: distributed decision-making without a single owner is the bug. It doesn’t matter how clean each component is; if N components can each initiate speech, you ship race conditions to people’s ears.

Architecture 2: server-gated turn-taking

Architecture 2: one decider, full control — and a server round-trip of dead air on every single turn.

The obvious fix for “too many deciders” is one decider. The second build made the server the single owner of every turn: the realtime speech model was muted by default and spoke only when the server explicitly triggered a response with instructions. Caller speaks → transcript → server classifies → server decides what should be said → server instructs the model to say it.

What it bought

What it cost

The deep lesson from architecture 2: in voice, latency is not a performance metric — it’s the product. An architecture that adds a server round-trip to every turn has already failed, no matter how correct it is. The latency budget for feeling human is roughly half a second; classification pipelines don’t fit in it.

Architecture 3: the model owns the conversation

Architecture 3: conversation, guarantees, and the record each have exactly one owner.

The third build inverts the relationship, and it’s the one that works. The realtime speech-to-speech model owns the conversation outright: it hears the caller, decides what to say, and says it — immediately, with no server gate. Everything the server cares about is expressed through three separate channels, each with exactly one owner:

What it costs (honestly)

The principle underneath

Looking back across all three builds, nearly every user-audible failure had the same root: two subsystems both believed they owned the next turn. Two goodbyes played back to back. A scripted prompt talking over a live answer. A re-asked question the caller had already answered. Different symptoms, one disease.

So the architecture question for voice agents isn’t “how do I control the model?” It’s an ownership ledger:

Rules I now hold as defaults for any voice agent:

  1. Never put a server round-trip between the caller and the agent’s voice. If a check can’t run async or after the call, it doesn’t belong in the turn loop.
  2. One owner per decision. If two code paths can both initiate speech or both end the call, you’ve already shipped the bug.
  3. Verify, never trust. Model (and policy) claims about state are hypotheses until checked against the actual data at a single shared gate.
  4. Backfill-only, evidence-grounded recovery. Let the transcript be the source of truth for the record — after the call, with a quote required for every extracted value.
  5. No language-specific code. If you’re writing a regex for how people say something, you’re re-implementing the model, badly, one language at a time.
  6. Log decisions, not just events. Every gate verdict, every recovered field, every intervention — one greppable line each. Voice bugs are reconstructed, not reproduced.

The third architecture isn’t cleverer than the first two — it’s humbler. It stops fighting the realtime model for the steering wheel, narrows the server to the few promises only a server can keep, and moves correctness to the one place that has the whole conversation and no clock: after the call.

Next in this series: how we test this thing — simulated callers, LLM-as-a-judge scoring, and the statistical discipline you need when your test subject is non-deterministic.

Exit mobile version