Features Overview

Voice AI Orchestration on a Communications Platform

Voice AI vendors at a glance

Direct agent and real-time connectors

Speech and realtime TTS options

Detailed capabilities

Voice AI Orchestration

Voximplant AI is a serverless runtime for Voice AI pipelines that connects real-time agent/LLM systems and speech engines to PSTN / SIP / WebRTC / mobile / WhatsApp calling, with code-driven orchestration and provider flexibility. See Voximplant AI and the docs Voice AI connectors section.

Supported vendors (direct agent / real-time LLM connectors)

Native/direct connectivity is positioned for:

OpenAI (Realtime / agent-style integrations) - Docs: OpenAI
Google Gemini (Live) - Docs: Google
Deepgram Voice Agent - Docs: Deepgram
ElevenLabs Agents / Conversational AI - Docs: ElevenLabs
Ultravox (WebSocket API) - Docs: Ultravox
Cartesia Line Agents - Docs: Cartesia Line Agents
xAI (Grok Voice Agent) - Docs: xAI

Voximplant AI also explicitly supports connecting to another WebSocket interface (for other real-time AI systems) in addition to the vendors above.

Supported vendors (speech engines: STT / TTS)

Voximplant’s platform speech layer (STT/TTS) includes built-in providers such as:

Speech-to-Text (STT): Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
Text-to-Speech (TTS): Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS, Tinkoff VoiceKit

For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules and guides for:

Cartesia Realtime TTS - Guide: Realtime TTS and API refs
Inworld Realtime TTS - Guide: Realtime TTS
ElevenLabs Streaming / realtime TTS - Guide: ElevenLabs TTS and API refs

Pipeline options (architectures you can run)

Speech-to-speech: real-time audio in and real-time audio out (agent API handles full duplex loop)
Speech -> LLM -> TTS: stream audio directly into a speech LLM and use a different TTS for output
STT -> LLM -> TTS: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
Hybrid: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components (mix and match)

Orchestration primitives (what you control)

Mix and match providers: swap STT/TTS/LLM vendors without changing your telephony integration
Parallel model execution: run multiple speech/LLM components in parallel when useful (for example, intent extraction + generation)
Failover paths: fall back to alternate speech/LLM providers when a step errors or times out
Wideband audio: higher fidelity audio path for improved user experience and model comprehension
Deep SIP support: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier environments
Channel portability: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and WhatsApp calling

Real-time media integration (streaming)

WebSocket-based media streaming for connecting calls to real-time AI systems and custom pipelines (audio + metadata/control messages on the same channel)
Media gateway abstraction: avoid building/operating custom streaming gateways when using native connectors/modules

Voice telephony

Connectivity and endpoints

PSTN calling (inbound/outbound) via phone numbers and programmable call handling
Phone numbers API: automated procurement in 60+ countries (availability varies by country)
SIP calling and trunking: connect carriers / PBXs / SBCs using SIP interop (including registration-based scenarios)
WebRTC calling via web/mobile SDKs (VoIP calling in apps and browsers)
WhatsApp calling: inbound/outbound voice calls via WhatsApp Business API integration

Serverless call control (VoxEngine)

JavaScript call logic (no XML) for real-time call routing and application workflows
Per-call-leg signaling/media control - granular control over each leg independently

Conferencing and bridging

Single conferencing API for voice/video; mix PSTN, SIP, WebRTC, and native mobile endpoints
Conferences up to 50 participants

Recording, transcription, and speech processing

Call recording via call.record() in scenarios (supports stereo and additional options)
Call transcription via record(transcribe=true) and retrieval via GetCallHistory (transcription delivered asynchronously)
Speaker/channel labeling in transcripts (for example, “Left”/“Right” labeling pattern described in docs)

Speech-to-Text (ASR) modes and features

Phrase-hint mode (best for constrained dialogs / IVRs) and Freeform mode (open transcription)
Multiple ASR engines (for example, Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
Intermediate results support (provider-dependent) for faster partial recognition
Google Speech v1p1beta1 feature passthrough (for example, word time offsets, punctuation, diarization config)

Answering machine / voicemail / beep detection

AMD module for voicemail/answering machine detection in scenarios
Beep detection with specified frequency lists and timeouts (scenario-level control)
AMD event/callback model available in VoxEngine references

Automated outbound calling (call lists + dialing logic)

Call Lists: upload a CSV call list and process it with VoxEngine scenarios (campaign-style calling)
Management API CallLists: programmatic call-list upload/append with delimiter support
Predictive Dialing System (PDS): uses agent/load statistics and call-list progression to place calls and connect answered calls to agents
Predictive and progressive dialing modes with tunable parameters (for example, allowed failed call percentage)

Tools and Developer Experience

Cloud IDE and debugging

Cloud IDE + debugger in the control panel:
Code verification
Autocompletion
Diff highlighting
Built-in troubleshooting workflow

Local IDE continuous integration

CLI tool for CI/CD automation so you can use your own IDE
Type library for local development with autocompletion and type checking

SDKs and client libraries

SDKs: iOS, Android, Web, React Native, Flutter, Unity
API clients: curl, Node.js, Python, PHP, Go, .NET, Java

Management API (HTTP)

Control accounts/services programmatically (examples from docs include managing phone numbers, messaging, billing, logs, records, and user access)

Real-time Media Streaming (WebSockets / Media Streams)

Media Streams: integrate live audio streams into calls via WebSockets for real-time transcription/analysis and AI integrations
WebSocket programming model in VoxEngine:
Create connections via VoxEngine.createWebSocket(...)
Stream audio using WebSocket.sendMediaTo(...)
Recommended audio chunk duration: ~20ms

Network, Reliability, and Deployment

Serverless runtime: no infrastructure to manage for call logic
CI/CD-friendly deployment path: CI tool for automated pipelines
Global footprint: datacenters in 14 countries
Status page for live and historical uptime of subcomponents

Voice AI vendors at a glance

Direct agent and real-time connectors

Speech and realtime TTS options

Detailed capabilities

Voice AI Orchestration

Supported vendors (direct agent / real-time LLM connectors)

Native/direct connectivity is positioned for:

OpenAI (Realtime / agent-style integrations) - Docs: OpenAI
Google Gemini (Live) - Docs: Google
Deepgram Voice Agent - Docs: Deepgram
ElevenLabs Agents / Conversational AI - Docs: ElevenLabs
Ultravox (WebSocket API) - Docs: Ultravox
Cartesia Line Agents - Docs: Cartesia Line Agents
xAI (Grok Voice Agent) - Docs: xAI

Voximplant AI also explicitly supports connecting to another WebSocket interface (for other real-time AI systems) in addition to the vendors above.

Supported vendors (speech engines: STT / TTS)

Voximplant’s platform speech layer (STT/TTS) includes built-in providers such as:

Speech-to-Text (STT): Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
Text-to-Speech (TTS): Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS, Tinkoff VoiceKit

For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules and guides for:

Cartesia Realtime TTS - Guide: Realtime TTS and API refs
Inworld Realtime TTS - Guide: Realtime TTS
ElevenLabs Streaming / realtime TTS - Guide: ElevenLabs TTS and API refs

Pipeline options (architectures you can run)

Speech-to-speech: real-time audio in and real-time audio out (agent API handles full duplex loop)
Speech -> LLM -> TTS: stream audio directly into a speech LLM and use a different TTS for output
STT -> LLM -> TTS: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
Hybrid: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components (mix and match)

Orchestration primitives (what you control)

Mix and match providers: swap STT/TTS/LLM vendors without changing your telephony integration
Parallel model execution: run multiple speech/LLM components in parallel when useful (for example, intent extraction + generation)
Failover paths: fall back to alternate speech/LLM providers when a step errors or times out
Wideband audio: higher fidelity audio path for improved user experience and model comprehension
Deep SIP support: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier environments
Channel portability: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and WhatsApp calling

Real-time media integration (streaming)

WebSocket-based media streaming for connecting calls to real-time AI systems and custom pipelines (audio + metadata/control messages on the same channel)
Media gateway abstraction: avoid building/operating custom streaming gateways when using native connectors/modules

Voice telephony

Connectivity and endpoints

PSTN calling (inbound/outbound) via phone numbers and programmable call handling
Phone numbers API: automated procurement in 60+ countries (availability varies by country)
SIP calling and trunking: connect carriers / PBXs / SBCs using SIP interop (including registration-based scenarios)
WebRTC calling via web/mobile SDKs (VoIP calling in apps and browsers)
WhatsApp calling: inbound/outbound voice calls via WhatsApp Business API integration

Serverless call control (VoxEngine)

JavaScript call logic (no XML) for real-time call routing and application workflows
Per-call-leg signaling/media control - granular control over each leg independently

Conferencing and bridging

Single conferencing API for voice/video; mix PSTN, SIP, WebRTC, and native mobile endpoints
Conferences up to 50 participants

Recording, transcription, and speech processing

Call recording via call.record() in scenarios (supports stereo and additional options)
Call transcription via record(transcribe=true) and retrieval via GetCallHistory (transcription delivered asynchronously)
Speaker/channel labeling in transcripts (for example, “Left”/“Right” labeling pattern described in docs)

Speech-to-Text (ASR) modes and features

Phrase-hint mode (best for constrained dialogs / IVRs) and Freeform mode (open transcription)
Multiple ASR engines (for example, Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
Intermediate results support (provider-dependent) for faster partial recognition
Google Speech v1p1beta1 feature passthrough (for example, word time offsets, punctuation, diarization config)

Answering machine / voicemail / beep detection

AMD module for voicemail/answering machine detection in scenarios
Beep detection with specified frequency lists and timeouts (scenario-level control)
AMD event/callback model available in VoxEngine references

Automated outbound calling (call lists + dialing logic)

Call Lists: upload a CSV call list and process it with VoxEngine scenarios (campaign-style calling)
Management API CallLists: programmatic call-list upload/append with delimiter support
Predictive Dialing System (PDS): uses agent/load statistics and call-list progression to place calls and connect answered calls to agents
Predictive and progressive dialing modes with tunable parameters (for example, allowed failed call percentage)

Tools and Developer Experience

Cloud IDE and debugging

Cloud IDE + debugger in the control panel:
Code verification
Autocompletion
Diff highlighting
Built-in troubleshooting workflow

Local IDE continuous integration

CLI tool for CI/CD automation so you can use your own IDE
Type library for local development with autocompletion and type checking

SDKs and client libraries

SDKs: iOS, Android, Web, React Native, Flutter, Unity
API clients: curl, Node.js, Python, PHP, Go, .NET, Java

Management API (HTTP)

Control accounts/services programmatically (examples from docs include managing phone numbers, messaging, billing, logs, records, and user access)

Real-time Media Streaming (WebSockets / Media Streams)

Media Streams: integrate live audio streams into calls via WebSockets for real-time transcription/analysis and AI integrations
WebSocket programming model in VoxEngine:
Create connections via VoxEngine.createWebSocket(...)
Stream audio using WebSocket.sendMediaTo(...)
Recommended audio chunk duration: ~20ms

Network, Reliability, and Deployment

Serverless runtime: no infrastructure to manage for call logic
CI/CD-friendly deployment path: CI tool for automated pipelines
Global footprint: datacenters in 14 countries
Status page for live and historical uptime of subcomponents

Features Overview

Features Overview

Voximplant Platform Capabilities

Capabilities at a glance

Voice AI vendors at a glance

Direct agent and real-time connectors

Speech and realtime TTS options

Detailed capabilities

Voice AI Orchestration

Voice telephony

Tools and Developer Experience

Real-time Media Streaming (WebSockets / Media Streams)

Network, Reliability, and Deployment

Voximplant Platform Capabilities

Capabilities at a glance

Voice AI vendors at a glance

Direct agent and real-time connectors

Speech and realtime TTS options

Detailed capabilities

Voice AI Orchestration

Voice telephony

Tools and Developer Experience

Real-time Media Streaming (WebSockets / Media Streams)

Network, Reliability, and Deployment