Voice and Real-Time Conversational Agents

Published:

Typing is slow. Talking is fast. When an agent can listen and speak in real time, the interaction shifts from asynchronous chat to something that feels like a phone call — immediate, fluid, and unforgiving of latency. Voice agents handle customer support calls, drive in-car assistants, power accessibility interfaces, and enable hands-free workflows.

But the engineering challenges are different from text-based agents. You are no longer optimizing for tokens per second; you are optimizing for milliseconds of silence. A 500ms pause in a text chat is invisible. A 500ms pause in a voice conversation feels broken.

The Voice Agent Pipeline #

A voice agent wraps a language model in an audio loop. The user speaks, the system transcribes, the model reasons, and the system speaks the response back. Each stage introduces latency, and those latencies stack.

Microphone → STT → Agent (LLM) → TTS → Speaker
   │                                        │
   └────────── ~200-800ms total ────────────┘

The pipeline has three core components:

  • Speech-to-text (STT): converts the audio stream into text tokens the model can process. Also called automatic speech recognition (ASR).
  • Agent logic: the language model processes the transcribed input, reasons, optionally calls tools, and generates a text response.
  • Text-to-speech (TTS): converts the model's text output into audio that sounds natural.

Each component runs as a streaming system. You do not wait for the user to finish speaking before starting transcription. You do not wait for the full model response before starting synthesis. Everything overlaps.

A naive implementation that waits for each stage to complete sequentially might take 2–3 seconds end-to-end. A well-engineered streaming pipeline brings that down to 200–500ms for first audio output — fast enough to feel conversational.

Streaming and Turn-Taking #

Human conversation follows turn-taking rules. People signal when they are done speaking — through intonation drops, completed sentences, or brief pauses. A voice agent must detect these signals to know when to start responding. Get it wrong, and you either interrupt the user mid-sentence or leave an awkward silence.

Endpointing is the process of detecting when the user has stopped speaking. Simple approaches use a silence threshold: if no speech is detected for 500–700ms, assume the turn is complete. This works for simple queries but fails with hesitations, mid-thought pauses, and list recitation. More sophisticated endpointing uses a model that considers prosody (pitch, rhythm) and linguistic completeness to predict turn boundaries.

Once the agent decides to respond, it must begin generating audio immediately. This requires streaming token generation: as the language model emits tokens, the TTS system synthesizes them incrementally. The first chunk of audio starts playing while the model is still generating the rest of the response.

Time ───────────────────────────────────────────────────────►

User speaks:    [────────────────]
STT streaming:   [─────────────────]
Endpoint detected:                    │
LLM generates:                        [───────────────────]
TTS streams:                            [─────────────────────]
User hears:                               [─────────────────────]

Barge-in is another critical capability: the user interrupts the agent mid-response. A robust voice agent detects incoming speech, stops playback immediately, cancels the in-flight generation, and processes the new input. Without barge-in, users feel trapped listening to irrelevant responses they cannot cut short.

Latency Budgets #

In voice, latency is the primary quality metric. Research on conversational dynamics puts the threshold for "natural" response time at roughly 200–300ms — the gap between human speaker turns. Anything above 800ms feels noticeably slow. Above 1.5 seconds, users assume the system is broken.

A latency budget allocates time across the pipeline:

Stage Target Notes
Endpointing 200–500ms Silence threshold plus processing
STT finalization 50–150ms Final transcript after endpoint
LLM time-to-first-token 100–300ms Model inference startup
TTS time-to-first-audio 50–150ms Synthesize first chunk
Network round-trips 20–100ms Per hop
Total first-audio 400–800ms End-to-end target

Every millisecond matters. Optimizations include:

  • Speculative generation: start generating a response before endpointing is confirmed, based on partial transcripts. If the user keeps talking, discard the speculative output.
  • Chunked TTS: synthesize audio in sentence-sized chunks rather than waiting for the full response. Play the first chunk while generating the rest.
  • Edge inference: run STT and TTS on-device or at the network edge to eliminate round-trip latency to a distant data center.
  • Model selection: use smaller, faster models for voice interactions. A 7B parameter model with 50ms time-to-first-token beats a 70B model with 300ms TTFT in perceived quality, even if the larger model produces marginally better text.
  • Connection persistence: keep WebSocket connections open to avoid TLS handshake costs on every turn.

Speech-to-Text in the Agent Loop #

STT is not a solved problem — it introduces errors that propagate downstream. Misheard words become wrong queries, wrong tool calls, wrong answers. The agent must be resilient to transcription noise.

Key design choices for STT integration:

Streaming vs. batch: streaming STT emits partial transcripts as the user speaks, enabling faster downstream processing. The partial transcripts are unstable — words change as more audio arrives. The system must handle revisions gracefully: display interim results to show responsiveness, but only act on final transcripts.

Language and accent handling: STT accuracy varies significantly by accent, dialect, background noise, and domain vocabulary. Medical terms, proper nouns, and technical jargon are common failure points. Fine-tuning the STT model on domain-specific audio (call center recordings, medical dictation) dramatically improves accuracy.

Confidence and fallback: when STT confidence is low, the agent should ask for clarification rather than proceed with a likely-wrong transcription. "I didn't catch that — could you repeat?" is better than a confidently wrong answer.

Multi-language: voice agents in global products must handle language detection and switching. A user might start in English and switch to Spanish mid-sentence. The STT system needs to either detect the language shift or operate in a multilingual mode.

Text-to-Speech and Voice Design #

TTS transforms the agent from a text box into a persona. The voice carries tone, pacing, emphasis, and emotional register. Modern neural TTS systems produce remarkably natural speech, but integrating them into a real-time agent introduces constraints.

Voice selection: the choice of voice (pitch, timbre, speaking rate) shapes user perception. A customer service agent might use a warm, mid-paced voice. A navigation assistant uses a crisp, efficient delivery. Most TTS providers offer pre-built voices and support voice cloning for custom brand voices.

Prosody control: flat reading of text sounds robotic. Good TTS adds appropriate pauses at commas, emphasis on key words, and rising intonation for questions. Some systems accept SSML (Speech Synthesis Markup Language) annotations for explicit prosody control; others infer it from context.

Streaming synthesis: to minimize latency, the TTS system must accept text incrementally and begin producing audio before the full response is available. This means the system cannot optimize prosody across the entire response — it must make local decisions about pacing and intonation. Sentence boundaries are natural breakpoints for chunk-based synthesis.

Caching: common phrases ("I can help with that," "Let me look that up") can be pre-synthesized and cached. This eliminates TTS latency for frequent responses entirely.

Architecture Patterns #

Voice agents follow several architectural patterns depending on latency requirements and complexity:

Cloud-mediated pipeline: STT, LLM, and TTS all run as cloud services connected by a real-time media server. The media server handles WebSocket connections to the client, streams audio to STT, routes text to the LLM, and streams TTS audio back. This is the most flexible architecture — you can swap models, add tools, and scale independently — but it accumulates network latency across three service calls.

Speech-to-speech models: newer architectures bypass the STT → text → TTS pipeline entirely. A single multimodal model takes audio in and produces audio out directly. This eliminates the transcription and synthesis stages, dramatically reducing latency and preserving paralinguistic information (tone, emotion, hesitation). The trade-off is less interpretability — there is no intermediate text to log, filter, or moderate — and fewer tool-calling capabilities in current implementations.

Hybrid edge-cloud: run STT and TTS on-device (phone, smart speaker, car) while the LLM runs in the cloud. This eliminates two network hops and puts the latency-sensitive audio processing close to the microphone and speaker. The LLM call is the only network round-trip.

┌─────────────────────────────────────┐
│            Client Device            │
│                                     │
│  Mic → [STT]  ──┐     ┌── [TTS] → Speaker
│                 │     │             │
└─────────────────┼─────┼─────────────┘
                  │     │
            text  │     │  text
                  ▼     │
         ┌─────────────────────┐
         │   Cloud LLM + Tools │
         └─────────────────────┘

Conclusion #

Voice agents transform the agent loop from a text exchange into a real-time audio conversation. The core challenges are latency management (budgeting milliseconds across STT, LLM, and TTS), turn-taking detection (knowing when the user is done speaking), streaming at every stage (overlapping transcription, generation, and synthesis), and graceful handling of speech recognition errors. The most effective architectures stream aggressively, use chunked synthesis to minimize time-to-first-audio, and treat latency as the primary optimization target — because in voice, silence is the failure mode.