The Real-Time Pipeline: How Audio Becomes an Agent Response
From microphone to speaker in under 500ms. A simplified walk-through of the voice agent pipeline — what happens at each stage and where latency hides.
When a caller speaks to a voice agent, their words travel through a pipeline of transformations before they hear a response. Understanding this pipeline is essential for diagnosing latency, choosing the right models, and knowing where to optimize.
The five stages
- Audio capture and transport — the caller's voice is captured by their microphone, encoded, and transmitted to the agent server. Over PSTN (phone), this adds 50–150ms. Over WebRTC (browser), 20–50ms.
- Speech recognition (ASR) — the audio stream is transcribed to text. Streaming ASR begins transcribing as the caller speaks, producing partial results. Final transcription happens at utterance end. Budget: 100–300ms.
- Language model reasoning (LLM) — the transcribed text plus conversation context is sent to the LLM for response generation. Streaming inference begins producing tokens immediately. Budget: 200–600ms for first token.
- Speech synthesis (TTS) — the LLM's text output is converted to audio. Streaming TTS begins generating audio from the first tokens without waiting for the complete response. Budget: 100–200ms for first audio chunk.
- Audio delivery — the synthesized audio is streamed back to the caller's device. Same transport latency as step 1.
Streaming is the key optimization
Without streaming, each stage waits for the previous to finish completely. Total latency: 1.5–3 seconds. Intolerable. With streaming, each stage processes incrementally: ASR sends partial transcriptions to the LLM, the LLM streams tokens to TTS, TTS streams audio to the caller. Effective latency drops to 400–700ms — fast enough that the conversation feels natural. This is why platforms purpose-built for real-time audio exist: the streaming orchestration is the hard part.
Ready to build?
See how Mazed's multimodal AI agents work for your use case.