Voximplant Platform Capabilities

Voximplant Platform is a cloud communications platform for building programmable voice, video, and messaging applications using serverless call control, SDKs, and APIs.

Use this page as a capability map: scan the summary cards first, then expand only the detailed sections you care about.

Capabilities at a glance

Voice AI orchestration

Connect real-time AI agents, speech systems, and telephony channels with code-driven orchestration.

Voice telephony

Run inbound/outbound PSTN, SIP, WebRTC, and WhatsApp voice flows with fine-grained call control.

Video telephony

Build WebRTC video experiences with recording, screen sharing, and voice/video interoperability.

Messaging

Deliver SMS, in-app messaging, push notifications, and webhook-driven backend integrations.

Tools and DX

Use cloud IDE/debugging, multi-platform SDKs, and Management API automation.

Media streaming + reliability

Stream media over WebSockets and deploy globally on serverless infrastructure.

Voice AI vendors at a glance

Direct agent and real-time connectors

OpenAI Realtime

Realtime and agent-style voice integrations.

Google Gemini Live

Live speech interactions with Gemini APIs.

Deepgram Voice Agent

Native voice-agent connector and examples.

ElevenLabs Agents

Conversational AI agent integrations.

Ultravox

WebSocket-based speech-native connector.

Cartesia Line Agents

Line Agents runtime with VoxEngine orchestration.

xAI Grok Voice Agent

Grok voice-agent flow and feature support.

Speech and realtime TTS options

Cartesia Realtime TTS

Realtime TTS pattern for half-cascade voice pipelines.

API refs: Cartesia module.

Inworld Realtime TTS

Realtime TTS option for half-cascade voice flows.

ElevenLabs Streaming TTS

Streaming/realtime TTS option for voice AI pipelines.

API refs: ElevenLabs module.

Detailed capabilities

Voice AI Orchestration

Voximplant AI is a serverless runtime for Voice AI pipelines that connects real-time agent/LLM systems and speech engines to PSTN / SIP / WebRTC / mobile / WhatsApp calling, with code-driven orchestration and provider flexibility. See Voximplant AI and the docs Voice AI connectors section.

Supported vendors (direct agent / real-time LLM connectors)

Native/direct connectivity is positioned for:

OpenAI (Realtime / agent-style integrations) - Docs: OpenAI
Google Gemini (Live) - Docs: Google
Deepgram Voice Agent - Docs: Deepgram
ElevenLabs Agents / Conversational AI - Docs: ElevenLabs
Ultravox (WebSocket API) - Docs: Ultravox
Cartesia Line Agents - Docs: Cartesia Line Agents
xAI (Grok Voice Agent) - Docs: xAI

Voximplant AI also explicitly supports connecting to another WebSocket interface (for other real-time AI systems) in addition to the vendors above.

Supported vendors (speech engines: STT / TTS)

Voximplant’s platform speech layer (STT/TTS) includes built-in providers such as:

Speech-to-Text (STT): Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
Text-to-Speech (TTS): Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS, Tinkoff VoiceKit

For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules and guides for:

Cartesia Realtime TTS - Guide: Realtime TTS and API refs
Inworld Realtime TTS - Guide: Realtime TTS
ElevenLabs Streaming / realtime TTS - Guide: ElevenLabs TTS and API refs

Pipeline options (architectures you can run)

Speech-to-speech: real-time audio in and real-time audio out (agent API handles full duplex loop)
Speech -> LLM -> TTS: stream audio directly into a speech LLM and use a different TTS for output
STT -> LLM -> TTS: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
Hybrid: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components (mix and match)

Orchestration primitives (what you control)

Mix and match providers: swap STT/TTS/LLM vendors without changing your telephony integration
Parallel model execution: run multiple speech/LLM components in parallel when useful (for example, intent extraction + generation)
Failover paths: fall back to alternate speech/LLM providers when a step errors or times out
Wideband audio: higher fidelity audio path for improved user experience and model comprehension
Deep SIP support: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier environments
Channel portability: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and WhatsApp calling

Real-time media integration (streaming)

WebSocket-based media streaming for connecting calls to real-time AI systems and custom pipelines (audio + metadata/control messages on the same channel)
Media gateway abstraction: avoid building/operating custom streaming gateways when using native connectors/modules

Voice telephony

Connectivity and endpoints

PSTN calling (inbound/outbound) via phone numbers and programmable call handling
Phone numbers API: automated procurement in 60+ countries (availability varies by country)
SIP calling and trunking: connect carriers / PBXs / SBCs using SIP interop (including registration-based scenarios)
WebRTC calling via web/mobile SDKs (VoIP calling in apps and browsers)
WhatsApp calling: inbound/outbound voice calls via WhatsApp Business API integration

Serverless call control (VoxEngine)

JavaScript call logic (no XML) for real-time call routing and application workflows
Per-call-leg signaling/media control - granular control over each leg independently

Conferencing and bridging

Single conferencing API for voice/video; mix PSTN, SIP, WebRTC, and native mobile endpoints
Conferences up to 50 participants

Recording, transcription, and speech processing

Call recording via call.record() in scenarios (supports stereo and additional options)
Call transcription via record(transcribe=true) and retrieval via GetCallHistory (transcription delivered asynchronously)
Speaker/channel labeling in transcripts (for example, “Left”/“Right” labeling pattern described in docs)

Speech-to-Text (ASR) modes and features

Phrase-hint mode (best for constrained dialogs / IVRs) and Freeform mode (open transcription)
Multiple ASR engines (for example, Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
Intermediate results support (provider-dependent) for faster partial recognition
Google Speech v1p1beta1 feature passthrough (for example, word time offsets, punctuation, diarization config)

Answering machine / voicemail / beep detection

AMD module for voicemail/answering machine detection in scenarios
Beep detection with specified frequency lists and timeouts (scenario-level control)
AMD event/callback model available in VoxEngine references

Automated outbound calling (call lists + dialing logic)

Call Lists: upload a CSV call list and process it with VoxEngine scenarios (campaign-style calling)
Management API CallLists: programmatic call-list upload/append with delimiter support
Predictive Dialing System (PDS): uses agent/load statistics and call-list progression to place calls and connect answered calls to agents
Predictive and progressive dialing modes with tunable parameters (for example, allowed failed call percentage)

Video telephony

WebRTC video API (server-based + P2P)

Video API to build server-based and P2P video experiences
SDKs abstract core WebRTC complexities:
- STUN/TURN/ICE
- Bandwidth optimization
- Video quality control

Real-time collaboration features

Screen sharing (share screen or window)
Recording for calls/conferences; storage in Voximplant Cloud or S3-compatible storage
Video streaming support (platform capability referenced in docs/features)

Voice/video interoperability

Bridge PSTN/SIP audio into video rooms as part of a unified conferencing model

Messaging

SMS

Send SMS via Management API and receive inbound SMS via HTTP callbacks (for SMS-capable numbers)

Instant Messaging (in-app chat)

Direct messaging between application users
Chat rooms up to 1000 participants
Chatbots for automated interactions

Push notifications (mobile)

Push notifications to wake devices for incoming calls and message notifications
Android push implementation is based on Firebase Cloud Messaging (FCM)

Webhooks / event delivery to your backend

HTTP Callbacks for event-driven notifications without polling the Management API

Tools and Developer Experience

Cloud IDE and debugging

Cloud IDE + debugger in the control panel:
- Code verification
- Autocompletion
- Diff highlighting
- Built-in troubleshooting workflow

SDKs and client libraries

SDKs: iOS, Android, Web, React Native, Flutter, Unity
API clients: curl, Node.js, Python, PHP, Go, .NET, Java

Management API (HTTP)

Control accounts/services programmatically (examples from docs include managing phone numbers, messaging, billing, logs, records, and user access)

Real-time Media Streaming (WebSockets / Media Streams)

Media Streams: integrate live audio streams into calls via WebSockets for real-time transcription/analysis and AI integrations
WebSocket programming model in VoxEngine:
- Create connections via VoxEngine.createWebSocket(...)
- Stream audio using WebSocket.sendMediaTo(...)
- Recommended audio chunk duration: ~20ms

Network, Reliability, and Deployment

Serverless runtime (no infrastructure to manage for call logic)
Global footprint: datacenters in 14 distinct countries (as stated on the platform page)
Status page for live and historical uptime of subcomponents