Building Voice Agents That Actually Work in Production

Latency, hallucinations, and interruption handling -- the three problems that kill voice AI demos before they become products. Here is how we solve them.

Voice AI demos are easy. Every founder who has seen a GPT-4o real-time demo has thought "I need that in my product." The gap between that demo and a voice agent that handles 500 concurrent calls without complaint is where the real engineering lives.

We have shipped voice agents for healthcare intake, SaaS support, and outbound sales qualification. Here is what the demos do not show you.

Problem 1: Latency

The voice AI pipeline has three stages, each adding latency: speech-to-text (STT), LLM inference, and text-to-speech (TTS). In a real-time conversation, users notice any response delay over 700ms. Most naive pipeline implementations land at 2-4 seconds. That is not a voice agent -- that is an awkward phone menu.

How to get under 700ms

Streaming TTS: Do not wait for the full LLM response before starting speech synthesis. Stream the LLM output token by token into your TTS engine. ElevenLabs, Deepgram, and Cartesia all support streaming TTS. The first word starts playing within 200-300ms while the model is still generating the rest.
Smaller, faster LLMs for simple turns: Not every conversational turn needs GPT-4o. A "yes, I can help with that, what is your account number?" response can come from a smaller, faster model. Reserve the large model for turns that require actual reasoning.
STT with real-time transcription: Deepgram Nova-2 and AssemblyAI Universal both support streaming transcription that starts returning text as the user speaks, rather than waiting for silence detection to trigger batch processing.

Our production target is 400-600ms first-word latency. With the above, it is consistently achievable.

Problem 2: Hallucinations in Voice Context

Hallucinations in a chatbot are annoying. Hallucinations in a voice agent are a customer service disaster -- because users cannot re-read the response, they cannot easily catch or contest an incorrect statement.

Mitigation strategies for production voice agents:

Strict grounding: Every factual claim the agent makes should be grounded in retrieved context (RAG) or structured data from your database. The system prompt should explicitly prohibit improvisation on facts: "If the answer is not in your context, say you will need to check and follow up."
Confidence thresholds: If retrieval similarity scores fall below a threshold, route to a human agent rather than attempting an answer. Define this threshold empirically using your golden test set.
Verbal hedging patterns: Train the system prompt to include natural verbal hedges for uncertain information: "Based on what I have on file..." or "I want to confirm this before I commit..." These set appropriate user expectations and open a natural correction loop.

Problem 3: Interruption Handling

Real conversations are not turn-taking exchanges. Users interrupt, change direction mid-sentence, and talk over the agent. A voice AI that cannot handle interruptions feels robotic and creates frustration.

The technical challenge is barge-in: detecting that the user has started speaking while the agent is still talking, stopping the agent's audio, and processing the new input without losing conversation state.

Implementation approach

Voice Activity Detection (VAD): Run continuous VAD on the user's audio channel even while the agent is speaking. Silero VAD runs locally with sub-10ms detection latency.
Graceful audio stop: When barge-in is detected, stop the TTS stream cleanly (mid-sentence is fine -- users do not mind) and reset the pipeline to STT mode.
Context preservation: Store the interrupted agent utterance in the conversation history as "[interrupted]" so the LLM knows the prior response was not completed. This prevents the agent from re-attempting the same response verbatim.

The Stack We Ship With

STT: Deepgram Nova-2 (streaming, best accuracy/latency balance)
LLM: Claude Haiku (simple turns), Claude Sonnet (reasoning turns)
TTS: ElevenLabs (highest naturalness), Cartesia (lowest latency)
VAD: Silero VAD via Python or JS bindings
Orchestration: LiveKit Agents framework (handles WebRTC, VAD pipeline, barge-in)
Infrastructure: Railway or Fly.io with region-local deployment for latency

The Honest Reality

Voice AI is not a feature you bolt on in a sprint. The latency work alone requires careful measurement and iteration. But the use cases where it lands -- appointment booking, support triage, outbound qualification -- deliver ROI that text-based chatbots cannot match, because users actually use them.

The bar for "good enough" in voice is higher than text. Clear context, fast response, and graceful handling of real conversational dynamics are not optional. Build to that bar or do not build it at all.