For the complete documentation index, see llms.txt.
VoxEngine supports a wide variety of Voice AI architectures and components. The right pipeline depends on your priorities — low latency, interactivity needs, voice choice, and low-level control are some factors to consider.
User↔LLM
Fully integrated Voice AI that often offers the lowest overall latency.
Fastest path | Lowest latencyLLM↔TTS
Realtime LLM for speech input and reasoning, with a separate TTS system for output voice.
More voice choice | Still realtimeSTT↔LLM↔TTS
Separate STT, LLM, and TTS stages for maximum provider flexibility and pipeline control.
Most configurable | Most complexSome vendors present a speech-to-speech (S2S) API externally, but internally manage a cascaded architecture with separate components with varying degrees of control and configuration. See more details below.
Voximplant always manages the telephony side of the call: routing, answering, bridging media, call control, and any orchestration logic you add in VoxEngine. Our Voice AI API clients - also referred to as connectors - provide a direct connection from VoxEngine to the provider’s environment with a low-latency WebSocket connection. Voximplant’s various connector options differ in how they handle transcription, reasoning, and speech synthesis internally.
LLMs like OpenAI, Gemini, and Grok provide direct access to the speech-enabled LLM. Speech input and output parameters are managed as part of the connector.
Ultravox provides a speech-native LLM, but integrates TTS within their environment, only exposing a Speech-to-Speech interface to Voximplant. Similarly, ElevenLabs, Cartesia, and Deepgram provide a speech-to-speech interface to Voximplant, but handle LLM, STT, and TTS elements internally. Configuration of these internal elements vary by vendor. We refer to the integrations as “indirect speech-to-speech”.
All vendors provide some level of end-of-turn detection, with some vendors providing more explicit control over turn-taking and barge-in.
The table below summarizes support by vendor from the Voximplant connector perspective.
In this pattern, caller audio is bridged to the provider in realtime, and the provider returns speech audio directly back into the call.


These connectors expose the speech-enabled LLM directly to VoxEngine.
Direct realtime speech-to-speech with OpenAI.
Direct live audio pipeline with Gemini Live.
Realtime voice interaction with Grok voice agent sessions.
These connectors present a speech-to-speech interface to VoxEngine while managing more of the internal speech and agent stack inside the provider environment.
Speech-native realtime agent with native speech input and output.
Conversational agent sessions with realtime speech handling.
Realtime voice agent flow with Deepgram speech input and output.
Speech-to-speech agent flow with Cartesia-managed runtime behavior.
Hybrid-cascade - sometimes called half-cascade - keeps the realtime LLM in the loop for user speech input and reasoning, but moves speech output to a separate synthesis provider. This is often the best compromise between latency and voice flexibility.


Realtime LLM with Cartesia realtime TTS.
Realtime LLM with ElevenLabs streaming TTS.
Realtime LLM with Inworld speech output.
Full cascade separates speech-to-text, reasoning, and text-to-speech into independent stages. This gives you the most control over the pipeline and the most freedom to mix providers.


For turn-taking, interruption, and end-of-turn behavior, see the Speech Flow Control guides.
Independent STT, LLM, and TTS with explicit turn-taking.
Reference guide for VAD, turn detection, and barge-in orchestration.
Related guide for playback and TTS orchestration patterns.
If you are new to Voice AI starting from scratch, we recommend the following progression:
That progression keeps the first version simple while leaving room to add flexibility only when the product requirements justify the extra moving parts.