***
title: Features and Capabilities
subtitle: Voice AI Orchestration on a Communications Platform
-------------------------------------------------------------
# Voximplant Platform Capabilities
Voximplant Platform is an established cloud communications platform for building programmable voice, video, and messaging applications using serverless call control, SDKs, and APIs.
Use this page as a capability map: scan the summary cards first, then expand only the detailed sections you care
about.
## Capabilities at a glance
Connect real-time AI agents, speech systems, and telephony channels with code-driven orchestration.
Run inbound/outbound PSTN, SIP, WebRTC, and WhatsApp voice flows with fine-grained call control.
Use cloud IDE/debugging, multi-platform SDKs, and Management API automation.
Deliver SMS, in-app messaging, push notifications, and webhook-driven backend integrations.
Build WebRTC video experiences with recording, screen sharing, and voice/video interoperability.
Stream media over WebSockets and deploy globally on serverless infrastructure.
## Voice AI vendors at a glance
### Direct agent and real-time connectors
Realtime and agent-style voice integrations.
Live speech interactions with Gemini APIs.
WebSocket-based speech-native connector.
Native voice-agent connector and examples.
Conversational AI agent integrations.
Line Agents runtime with VoxEngine orchestration.
Grok voice-agent flow and feature support.
### Speech and realtime TTS options
Realtime TTS pattern for half-cascade voice pipelines.
API refs: [Cartesia module](https://voximplant.com/docs/references/voxengine/cartesia).
Realtime TTS option for half-cascade voice flows.
Streaming/realtime TTS option for voice AI pipelines.
API refs: [ElevenLabs module](https://voximplant.com/docs/references/voxengine/elevenlabs).
Detailed capabilities
### Voice AI Orchestration
Voximplant AI is a serverless runtime for **Voice AI pipelines** that connects real-time agent/LLM systems
and speech engines to **PSTN / SIP / WebRTC / mobile / WhatsApp calling**, with code-driven orchestration
and provider flexibility. See [Voximplant AI](https://voximplant.ai/) and the docs [Voice AI connectors
section](/voice-ai-connectors/voice-ai-connectors-overview).
**Supported vendors (direct agent / real-time LLM connectors)**
Native/direct connectivity is positioned for:
* **OpenAI** (Realtime / agent-style integrations) - [Docs:
OpenAI](/voice-ai-connectors/openai/overview)
* **Google Gemini (Live)** - [Docs:
Google](/voice-ai-connectors/gemini/overview)
* **Deepgram Voice Agent** - [Docs:
Deepgram](/voice-ai-connectors/deepgram/overview)
* **ElevenLabs Agents / Conversational AI** - [Docs:
ElevenLabs](/voice-ai-connectors/elevenlabs/overview)
* **Ultravox (WebSocket API)** - [Docs:
Ultravox](/voice-ai-connectors/ultravox/overview)
* **Cartesia Line Agents** - [Docs: Cartesia Line
Agents](/voice-ai-connectors/cartesia/overview)
* **xAI (Grok Voice Agent)** - [Docs:
xAI](/voice-ai-connectors/grok/overview)
Voximplant AI also explicitly supports connecting to **another WebSocket interface** (for other real-time AI
systems) in addition to the vendors above.
**Supported vendors (speech engines: STT / TTS)**
Voximplant's platform speech layer (STT/TTS) includes built-in providers such as:
* **Speech-to-Text (STT)**: Google Speech Cloud, Microsoft Azure STT, Amazon Transcribe, Yandex Speech Cloud
* **Text-to-Speech (TTS)**: Google Speech Cloud, Amazon Polly, Yandex Speech Cloud, Microsoft Azure TTS,
Tinkoff VoiceKit
For realtime / streaming TTS used in Voice AI scenarios, Voximplant also provides native VoxEngine modules
and guides for:
* **Cartesia Realtime TTS** - [Guide: Realtime
TTS](/voice-ai-connectors/openai/half-cascade-cartesia)
and [API refs](https://voximplant.com/docs/references/voxengine/cartesia)
* **Inworld Realtime TTS** - [Guide: Realtime
TTS](/voice-ai-connectors/openai/half-cascade-inworld)
* **ElevenLabs Streaming / realtime TTS** - [Guide: ElevenLabs
TTS](/voice-ai-connectors/openai/half-cascade-elevenlabs)
and [API refs](https://voximplant.com/docs/references/voxengine/elevenlabs)
**Pipeline options (architectures you can run)**
* **Speech-to-speech**: real-time audio in and real-time audio out (agent API handles full duplex loop)
* **Speech -> LLM -> TTS**: stream audio directly into a speech LLM and use a different TTS for output
* **STT -> LLM -> TTS**: stream audio to STT, pass text to an LLM/toolchain, synthesize response audio
* **Hybrid**: combine a real-time agent API for turn-taking with separate best-of-breed STT/TTS components
(mix and match)
**Orchestration primitives (what you control)**
* **Mix and match providers**: swap STT/TTS/LLM vendors without changing your telephony integration
* **Parallel model execution**: run multiple speech/LLM components in parallel when useful (for example,
intent extraction + generation)
* **Failover paths**: fall back to alternate speech/LLM providers when a step errors or times out
* **Wideband audio**: higher fidelity audio path for improved user experience and model comprehension
* **Deep SIP support**: SIP trunking + registration interop so agents can operate inside PBX/SBC/carrier
environments
* **Channel portability**: reuse the same AI pipeline across PSTN numbers, SIP, WebRTC, mobile SDKs, and
WhatsApp calling
**Real-time media integration (streaming)**
* **WebSocket-based media streaming** for connecting calls to real-time AI systems and custom pipelines
(audio + metadata/control messages on the same channel)
* **Media gateway abstraction**: avoid building/operating custom streaming gateways when using native
connectors/modules
### Voice telephony
**Connectivity and endpoints**
* **PSTN calling** (inbound/outbound) via phone numbers and programmable call handling
* **Phone numbers API**: automated procurement in **60+ countries** (availability varies by country)
* **SIP calling and trunking**: connect carriers / PBXs / SBCs using SIP interop (including
registration-based scenarios)
* **WebRTC calling** via web/mobile SDKs (VoIP calling in apps and browsers)
* **WhatsApp calling**: inbound/outbound voice calls via WhatsApp Business API integration
**Serverless call control (VoxEngine)**
* **JavaScript call logic (no XML)** for real-time call routing and application workflows
* **Per-call-leg signaling/media control** - granular control over each leg independently
**Conferencing and bridging**
* **Single conferencing API** for voice/video; **mix PSTN, SIP, WebRTC, and native mobile endpoints**
* **Conferences up to 50 participants**
**Recording, transcription, and speech processing**
* **Call recording** via `call.record()` in scenarios (supports stereo and additional options)
* **Call transcription** via `record(transcribe=true)` and retrieval via `GetCallHistory` (transcription
delivered asynchronously)
* **Speaker/channel labeling** in transcripts (for example, "Left"/"Right" labeling pattern described in
docs)
**Speech-to-Text (ASR) modes and features**
* **Phrase-hint mode** (best for constrained dialogs / IVRs) and **Freeform mode** (open transcription)
* **Multiple ASR engines** (for example, Google, Amazon, Microsoft, Yandex, T-bank) with selectable profiles
* **Intermediate results** support (provider-dependent) for faster partial recognition
* **Google Speech v1p1beta1 feature passthrough** (for example, word time offsets, punctuation, diarization
config)
**Answering machine / voicemail / beep detection**
* **AMD module** for voicemail/answering machine detection in scenarios
* **Beep detection** with specified frequency lists and timeouts (scenario-level control)
* **AMD event/callback model** available in VoxEngine references
**Automated outbound calling (call lists + dialing logic)**
* **Call Lists**: upload a **CSV call list** and process it with VoxEngine scenarios (campaign-style
calling)
* **Management API CallLists**: programmatic call-list upload/append with delimiter support
* **Predictive Dialing System (PDS)**: uses agent/load statistics and call-list progression to place calls
and connect answered calls to agents
* **Predictive and progressive dialing modes** with tunable parameters (for example, allowed failed call
percentage)
### Video telephony
**WebRTC video API (server-based + P2P)**
* **Video API** to build server-based and P2P video experiences
* SDKs abstract core WebRTC complexities:
* **STUN/TURN/ICE**
* **Bandwidth optimization**
* **Video quality control**
**Real-time collaboration features**
* **Screen sharing** (share screen or window)
* **Recording** for calls/conferences; storage in Voximplant Cloud or S3-compatible storage
* **Video streaming** support (platform capability referenced in docs/features)
**Voice/video interoperability**
* Bridge **PSTN/SIP audio into video rooms** as part of a unified conferencing model
### Messaging
**SMS**
* **Send SMS via Management API** and **receive inbound SMS via HTTP callbacks** (for SMS-capable numbers)
**Instant Messaging (in-app chat)**
* **Direct messaging** between application users
* **Chat rooms up to 1000 participants**
* **Chatbots** for automated interactions
**Push notifications (mobile)**
* Push notifications to wake devices for **incoming calls** and **message notifications**
* Android push implementation is based on **Firebase Cloud Messaging (FCM)**
**Webhooks / event delivery to your backend**
* **HTTP Callbacks** for event-driven notifications without polling the Management API
### Tools and Developer Experience
**Cloud IDE and debugging**
* **Cloud IDE + debugger** in the control panel:
* **Code verification**
* **Autocompletion**
* **Diff highlighting**
* Built-in troubleshooting workflow
**SDKs and client libraries**
* SDKs: **iOS, Android, Web, React Native, Flutter, Unity**
* API clients: **curl, Node.js, Python, PHP, Go, .NET, Java**
**Management API (HTTP)**
* Control accounts/services programmatically (examples from docs include managing phone numbers, messaging,
billing, logs, records, and user access)
### Real-time Media Streaming (WebSockets / Media Streams)
* **Media Streams**: integrate **live audio streams** into calls via WebSockets for real-time
transcription/analysis and AI integrations
* WebSocket programming model in VoxEngine:
* Create connections via `VoxEngine.createWebSocket(...)`
* Stream audio using `WebSocket.sendMediaTo(...)`
* Recommended audio chunk duration: **\~20ms**
### Network, Reliability, and Deployment
* **Serverless runtime** (no infrastructure to manage for call logic)
* **Global footprint**: datacenters in **14** distinct countries (as stated on the platform page)
* **Status page** for live and historical uptime of subcomponents