Capabilities
Voximplant Platform Capabilities
Voximplant Platform is a cloud communications platform for building programmable voice, video, and messaging applications using serverless call control, SDKs, and APIs.
Use this page as a capability map: scan the summary cards first, then expand only the detailed sections you care about.
Capabilities at a glance
Connect real-time AI agents, speech systems, and telephony channels with code-driven orchestration.
Run inbound/outbound PSTN, SIP, WebRTC, and WhatsApp voice flows with fine-grained call control.
Build WebRTC video experiences with recording, screen sharing, and voice/video interoperability.
Deliver SMS, in-app messaging, push notifications, and webhook-driven backend integrations.
Use cloud IDE/debugging, multi-platform SDKs, and Management API automation.
Stream media over WebSockets and deploy globally on serverless infrastructure.
Voice AI vendors at a glance
Direct agent and real-time connectors
Realtime and agent-style voice integrations.
Live speech interactions with Gemini APIs.
Native voice-agent connector and examples.
Conversational AI agent integrations.
WebSocket-based speech-native connector.

Line Agents runtime with VoxEngine orchestration.
Grok voice-agent flow and feature support.
Speech and realtime TTS options

Realtime TTS pattern for half-cascade voice pipelines.
API refs: Cartesia module.

Realtime TTS option for half-cascade voice flows.
Streaming/realtime TTS option for voice AI pipelines.
API refs: ElevenLabs module.
Detailed capabilities
Voice AI Orchestration
Voximplant AI is a serverless runtime for Voice AI pipelines that connects real-time agent/LLM systems and speech engines to PSTN / SIP / WebRTC / mobile / WhatsApp calling, with code-driven orchestration and provider flexibility. See Voximplant AI and the docs Voice AI connectors section.
Supported vendors (direct agent / real-time LLM connectors)
Native/direct connectivity is positioned for:
- OpenAI (Realtime / agent-style integrations) - Docs: OpenAI
- Google Gemini (Live) - Docs: Google
- Deepgram Voice Agent - Docs: Deepgram
- ElevenLabs Agents / Conversational AI - Docs: ElevenLabs
- Ultravox (WebSocket API) - Docs: Ultravox
- Cartesia Line Agents - Docs: Cartesia Line Agents
- xAI (Grok Voice Agent) - Docs: xAI
Voximplant AI also explicitly supports connecting to another WebSocket interface (for other real-time AI systems) in addition to the vendors above.
Supported vendors (speech engines: STT / TTS)
Voximplant’s platform speech layer (STT/TTS) includes built-in providers such as:
- Speech-to-Text (STT): Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
- Text-to-Speech (TTS): Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS, Tinkoff VoiceKit
For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules and guides for:
- Cartesia Realtime TTS - Guide: Realtime TTS and API refs
- Inworld Realtime TTS - Guide: Realtime TTS
- ElevenLabs Streaming / realtime TTS - Guide: ElevenLabs TTS and API refs
Pipeline options (architectures you can run)
- Speech-to-speech: real-time audio in and real-time audio out (agent API handles full duplex loop)
- Speech -> LLM -> TTS: stream audio directly into a speech LLM and use a different TTS for output
- STT -> LLM -> TTS: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
- Hybrid: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components (mix and match)
Orchestration primitives (what you control)
- Mix and match providers: swap STT/TTS/LLM vendors without changing your telephony integration
- Parallel model execution: run multiple speech/LLM components in parallel when useful (for example, intent extraction + generation)
- Failover paths: fall back to alternate speech/LLM providers when a step errors or times out
- Wideband audio: higher fidelity audio path for improved user experience and model comprehension
- Deep SIP support: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier environments
- Channel portability: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and WhatsApp calling
Real-time media integration (streaming)
- WebSocket-based media streaming for connecting calls to real-time AI systems and custom pipelines (audio + metadata/control messages on the same channel)
- Media gateway abstraction: avoid building/operating custom streaming gateways when using native connectors/modules
Voice telephony
Connectivity and endpoints
- PSTN calling (inbound/outbound) via phone numbers and programmable call handling
- Phone numbers API: automated procurement in 60+ countries (availability varies by country)
- SIP calling and trunking: connect carriers / PBXs / SBCs using SIP interop (including registration-based scenarios)
- WebRTC calling via web/mobile SDKs (VoIP calling in apps and browsers)
- WhatsApp calling: inbound/outbound voice calls via WhatsApp Business API integration
Serverless call control (VoxEngine)
- JavaScript call logic (no XML) for real-time call routing and application workflows
- Per-call-leg signaling/media control - granular control over each leg independently
Conferencing and bridging
- Single conferencing API for voice/video; mix PSTN, SIP, WebRTC, and native mobile endpoints
- Conferences up to 50 participants
Recording, transcription, and speech processing
- Call recording via
call.record()in scenarios (supports stereo and additional options) - Call transcription via
record(transcribe=true)and retrieval viaGetCallHistory(transcription delivered asynchronously) - Speaker/channel labeling in transcripts (for example, “Left”/“Right” labeling pattern described in docs)
Speech-to-Text (ASR) modes and features
- Phrase-hint mode (best for constrained dialogs / IVRs) and Freeform mode (open transcription)
- Multiple ASR engines (for example, Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
- Intermediate results support (provider-dependent) for faster partial recognition
- Google Speech v1p1beta1 feature passthrough (for example, word time offsets, punctuation, diarization config)
Answering machine / voicemail / beep detection
- AMD module for voicemail/answering machine detection in scenarios
- Beep detection with specified frequency lists and timeouts (scenario-level control)
- AMD event/callback model available in VoxEngine references
Automated outbound calling (call lists + dialing logic)
- Call Lists: upload a CSV call list and process it with VoxEngine scenarios (campaign-style calling)
- Management API CallLists: programmatic call-list upload/append with delimiter support
- Predictive Dialing System (PDS): uses agent/load statistics and call-list progression to place calls and connect answered calls to agents
- Predictive and progressive dialing modes with tunable parameters (for example, allowed failed call percentage)
Video telephony
WebRTC video API (server-based + P2P)
- Video API to build server-based and P2P video experiences
- SDKs abstract core WebRTC complexities:
- STUN/TURN/ICE
- Bandwidth optimization
- Video quality control
Real-time collaboration features
- Screen sharing (share screen or window)
- Recording for calls/conferences; storage in Voximplant Cloud or S3-compatible storage
- Video streaming support (platform capability referenced in docs/features)
Voice/video interoperability
- Bridge PSTN/SIP audio into video rooms as part of a unified conferencing model
Messaging
SMS
- Send SMS via Management API and receive inbound SMS via HTTP callbacks (for SMS-capable numbers)
Instant Messaging (in-app chat)
- Direct messaging between application users
- Chat rooms up to 1000 participants
- Chatbots for automated interactions
Push notifications (mobile)
- Push notifications to wake devices for incoming calls and message notifications
- Android push implementation is based on Firebase Cloud Messaging (FCM)
Webhooks / event delivery to your backend
- HTTP Callbacks for event-driven notifications without polling the Management API
Tools and Developer Experience
Cloud IDE and debugging
- Cloud IDE + debugger in the control panel:
- Code verification
- Autocompletion
- Diff highlighting
- Built-in troubleshooting workflow
SDKs and client libraries
- SDKs: iOS, Android, Web, React Native, Flutter, Unity
- API clients: curl, Node.js, Python, PHP, Go, .NET, Java
Management API (HTTP)
- Control accounts/services programmatically (examples from docs include managing phone numbers, messaging, billing, logs, records, and user access)
Real-time Media Streaming (WebSockets / Media Streams)
- Media Streams: integrate live audio streams into calls via WebSockets for real-time transcription/analysis and AI integrations
- WebSocket programming model in VoxEngine:
- Create connections via
VoxEngine.createWebSocket(...) - Stream audio using
WebSocket.sendMediaTo(...) - Recommended audio chunk duration: ~20ms
- Create connections via
Network, Reliability, and Deployment
- Serverless runtime (no infrastructure to manage for call logic)
- Global footprint: datacenters in 14 distinct countries (as stated on the platform page)
- Status page for live and historical uptime of subcomponents