Voximplant Platform Capabilities

Voximplant Platform is a cloud communications platform for building programmable voice, video, and messaging applications using serverless call control, SDKs, and APIs.

Voice AI Orchestration

Voximplant AI is a serverless runtime for Voice AI pipelines that connects real-time agent/LLM systems and speech engines to PSTN / SIP / WebRTC / mobile / WhatsApp calling, with code-driven orchestration and provider flexibility. (See: Voximplant AI and the Voximplant docs Voice AI section.)

Supported vendors (direct agent / real-time LLM connectors)

Native/direct connectivity is positioned for:

OpenAI (Realtime / agent-style integrations) — Docs: OpenAI
Google Gemini (Live) — Docs: Google
Deepgram Voice Agent — Docs: Deepgram
ElevenLabs Agents / Conversational AI — Docs: ElevenLabs
Ultravox (WebSocket API) — Docs: Ultravox
xAI (Grok Voice Agent) — Docs: xAI

Voximplant AI also explicitly supports connecting to another WebSocket interface (for other real-time AI systems) in addition to the vendors above.

Supported vendors (speech engines: STT / TTS)

Voximplant’s platform speech layer (STT/TTS) includes built-in providers such as:

Speech-to-Text (STT): Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
Text-to-Speech (TTS): Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS, Tinkoff VoiceKit

For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules and guides for:

Cartesia Realtime TTS — Guide: Realtime TTS and API refs
Inworld Realtime TTS — Guide: Realtime TTS
ElevenLabs Streaming / realtime TTS — Guide: ElevenLabs TTS and API refs

Pipeline options (architectures you can run)

Speech-to-speech: real-time audio in ↔ real-time audio out (agent API handles full duplex loop)
Speech → LLM → TTS: stream audio directly into a speech LLM and use a different TTS for output
STT → LLM → TTS: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
Hybrid: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components (“mix & match”)

Orchestration primitives (what you control)

Mix & match providers: swap STT/TTS/LLM vendors without changing your telephony integration
Parallel model execution: run multiple speech/LLM components in parallel when useful (e.g., intent extraction + generation)
Failover paths: fall back to alternate speech/LLM providers when a step errors or times out
Wideband audio: higher fidelity audio path for improved user experience and model comprehension
Deep SIP support: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier environments
Channel portability: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and WhatsApp calling

Real-time media integration (streaming)

WebSocket-based media streaming for connecting calls to real-time AI systems and custom pipelines (audio + metadata/control messages on the same channel)
Media gateway abstraction: avoid building/operating custom streaming gateways when using native connectors/modules

Voice telephony

Connectivity and endpoints

PSTN calling (inbound/outbound) via phone numbers and programmable call handling
Phone numbers API: automated procurement in 60+ countries (availability varies by country)
SIP calling and trunking: connect carriers / PBXs / SBCs using SIP interop (including registration-based scenarios)
WebRTC calling via web/mobile SDKs (VoIP calling in apps and browsers)
WhatsApp calling: inbound/outbound voice calls via WhatsApp Business API integration

Serverless call control (VoxEngine)

JavaScript call logic (no XML) for real-time call routing and application workflows
Per-call-leg signaling/media control - granular control over each leg independently

Conferencing and bridging

Single conferencing API for voice/video; mix PSTN, SIP, WebRTC, and native mobile endpoints
Conferences up to 50 participants

Recording, transcription, and speech processing

Call recording via call.record() in scenarios (supports stereo and additional options)
Call transcription via record(transcribe=true) and retrieval via GetCallHistory (transcription delivered asynchronously)
Speaker/channel labeling in transcripts (e.g., “Left”/“Right” labeling pattern described in docs)

Speech-to-Text (ASR) modes and features

Phrase-hint mode (best for constrained dialogs / IVRs) and Freeform mode (open transcription)
Multiple ASR engines (e.g., Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
Intermediate results support (provider-dependent) for faster partial recognition
Google Speech v1p1beta1 feature passthrough (e.g., word time offsets, punctuation, diarization config)

Answering machine / voicemail / beep detection

AMD module for voicemail/answering machine detection in scenarios
Beep detection with specified frequency lists and timeouts (scenario-level control)
AMD event/callback model available in VoxEngine references

Automated outbound calling (call lists + dialing logic)

Call Lists: upload a CSV call list and process it with VoxEngine scenarios (campaign-style calling)
Management API CallLists: programmatic call-list upload/append with delimiter support
Predictive Dialing System (PDS):
- Uses agent/load statistics and call-list progression to place calls and connect answered calls to agents
- Supports predictive and progressive modes with tunable parameters (e.g., allowed failed call %)

Video telephony

WebRTC video API (server-based + P2P)

Video API to build server-based and P2P video experiences
SDKs abstract core WebRTC complexities:
- STUN/TURN/ICE
- Bandwidth optimization
- Video quality control

Real-time collaboration features

Screen sharing (share screen or window)
Recording for calls/conferences; storage in Voximplant Cloud or S3-compatible storage
Video streaming support (platform capability referenced in docs/features)

Voice/video interoperability

Bridge PSTN/SIP audio into video rooms as part of a unified conferencing model

Messaging

SMS

Send SMS via Management API and receive inbound SMS via HTTP callbacks (for SMS-capable numbers)

Instant Messaging (in-app chat)

Direct messaging between application users
Chat rooms up to 1000 participants
Chatbots for automated interactions

Push notifications (mobile)

Push notifications to wake devices for incoming calls and message notifications
Android push implementation is based on Firebase Cloud Messaging (FCM)

Webhooks / event delivery to your backend

HTTP Callbacks for event-driven notifications without polling the Management API

Tools and Developer Experience

Cloud IDE and debugging

Cloud IDE + debugger in the control panel:
- Code verification
- Autocompletion
- Diff highlighting
- Built-in troubleshooting workflow

SDKs and client libraries

SDKs: iOS, Android, Web, React Native, Flutter, Unity
API clients: curl, Node.js, Python, PHP, Go, .NET, Java

Management API (HTTP)

Control accounts/services programmatically (examples from docs include managing phone numbers, messaging, billing, logs, records, user access)

Real-time Media Streaming (WebSockets / Media Streams)

Media Streams: integrate live audio streams into calls via WebSockets for real-time transcription/analysis and AI integrations
WebSocket programming model in VoxEngine:
- Create connections via VoxEngine.createWebSocket(...)
- Stream audio using WebSocket.sendMediaTo(...)
- Recommended audio chunk duration: ~20ms

Network, Reliability, and Deployment

Serverless runtime (no infra to manage for call logic)
Global footprint: “datacenters in 14 distinct countries” (as stated on the platform page)
Status page for live and historical uptime of subcomponents

Voice AI Integrations (optional, adjacent capability)

If you need speech-to-speech agents and multi-provider orchestration, Voximplant positions Voximplant AI as a serverless runtime for Voice AI pipelines with telephony and real-time connectivity (phone numbers, SIP, mobile, WebRTC, WhatsApp).