Media over WebSockets | Voximplant Voice AI

WebSocket is a full-duplex protocol for real-time communication between a client and a third-party service. VoxEngine can create outgoing WebSocket connections, accept incoming WebSocket connections, and exchange text or audio media over either connection type.

Audio over WebSockets is controlled by JSON messages with an event field. A media stream starts with start, carries audio chunks in media, and ends with stop. See Media stream format for details on the message schema.

WebSocket Connectivity Overview

Voximplant supports both outgoing and incoming WebSocket connections and can send both text and media data. The interface includes specialized handling for audio media streams for interoperability with calls, conferences, recorders, and other VoxEngine media units.

Outgoing WebSocket connections

Create a WebSocket object with VoxEngine.createWebSocket(...). Voximplant Cloud performs the WebSocket handshake with the third-party service, then the scenario receives WebSocketEvents.OPEN.

After the connection is open:

Send strings and stringified data with webSocket.send(...).
Send call audio to the WebSocket with call.sendMediaTo(webSocket, parameters). Use SendMediaParameters to set encoding, tag, and customParameters. If encoding is omitted, WebSocketAudioEncoding defaults to PCM16_8KHZ.
Stop media with stopMediaTo(...) or close the connection with webSocket.close().

The minimal setup for an outgoing connection looks like this:

1 VoxEngine.addEventListener(AppEvents.CallAlerting, (event) => {
2   const webSocket = VoxEngine.createWebSocket("wss://your-url.example.com");
3   // Handle WebSocketEvents.OPEN, MESSAGE, ERROR, and CLOSE.
4 });

Existing WebSocket connections are not automatically destroyed when a call ends. Make sure to close outgoing and incoming WebSocket connections with webSocket.close() when the scenario no longer needs them. A WebSocket can also be closed by Voximplant Cloud or by the third-party service, so handle WebSocketEvents.CLOSE in every flow.

Incoming WebSocket connections

Enable incoming WebSocket access with VoxEngine.allowWebSocketConnections(), then subscribe to AppEvents.WebSocket. The third-party service connects to the session WebSocket URL, and VoxEngine gives the scenario a websocket object for that connection.

Get the session URL from the StartScenarios Management API response or from AppEvents.Started. Replace https with wss before giving the URL to the external service.

The minimal setup for accepting an incoming WebSocket connection looks like this:

1 VoxEngine.allowWebSocketConnections();
2 
3 VoxEngine.addEventListener(AppEvents.WebSocket, (event) => {
4   const webSocket = event.websocket;
5   // Handle the incoming WebSocket connection here.
6 });

The maximum number of incoming WebSocket connections cannot be greater than the number of calls in the session plus 3. If a session receives one more connection, VoxEngine triggers AppEvents.NewWebSocketFailed.

Sending text data

Use WebSocket.send(...) to enqueue string data for transmission over the WebSocket connection. For structured messages, serialize your payload before calling send. Audio media itself is exchanged as JSON start, media, and stop events; see Media stream format.

1 VoxEngine.addEventListener(AppEvents.CallAlerting, ({ call }) => {
2   call.answer();
3   call.addEventListener(CallEvents.Disconnected, VoxEngine.terminate);
4 
5   const webSocket = VoxEngine.createWebSocket("wss://your-url.example.com");
6 
7   webSocket.addEventListener(WebSocketEvents.OPEN, () => {
8     Logger.write("Sending message");
9     webSocket.send("Some test message");
10   });
11 
12   webSocket.addEventListener(WebSocketEvents.MESSAGE, (message) => {
13     Logger.write(`Received message ${message.text}`);
14   });
15 
16   webSocket.addEventListener(WebSocketEvents.CLOSE, () => {
17     VoxEngine.terminate();
18   });
19 });

Optional echo server for text testing

1 import asyncio
2 import websockets
3 
4 async def echo(websocket):
5     async for message in websocket:
6         await websocket.send(message)
7 
8 async def main():
9     async with websockets.serve(echo, "localhost", 8765):
10         await asyncio.Future()
11 
12 asyncio.run(main())

Working with Audio Media

VoxEngine media streams use JSON messages to describe audio sent over WebSocket connections. Use this format when a custom application receives call audio from Voximplant or sends audio back into a call, conference, recorder, or another media unit.

Media streams are started and stopped from a VoxEngine scenario with methods such as sendMediaTo(...) and stopMediaTo(...). Each stream starts with a start event, carries audio chunks in media events, and ends with a stop event.

The event field is reserved for the system media-stream events (start, media, stop) and is mandatory for them. For your own application messages, use a customEvent field instead. Do not combine event and customEvent in the same message.

Sending call audio to a WebSocket

Use call.sendMediaTo(webSocket, parameters) to stream call audio to a WebSocket. The optional SendMediaParameters object lets you set:

encoding: a WebSocketAudioEncoding value. The default is PCM16_8KHZ.
tag: a label for matching start, media, and stop events that belong to the same stream.
customParameters: application-specific metadata included in the stream start event.

1 VoxEngine.addEventListener(AppEvents.CallAlerting, ({ call }) => {
2   call.answer();
3   call.addEventListener(CallEvents.Disconnected, VoxEngine.terminate);
4 
5   const webSocket = VoxEngine.createWebSocket("wss://your-url.example.com");
6 
7   webSocket.addEventListener(WebSocketEvents.OPEN, () => {
8     call.sendMediaTo(webSocket, {
9       encoding: WebSocketAudioEncoding.ALAW,
10       tag: "call",
11       customParameters: { source: "inbound-call" },
12     });
13   });
14 
15   webSocket.addEventListener(WebSocketEvents.MESSAGE, (message) => {
16     Logger.write(JSON.stringify(message));
17   });
18 
19   webSocket.addEventListener(WebSocketEvents.CLOSE, VoxEngine.terminate);
20 });

Receiving payload and parameter details

When Voximplant sends audio to your WebSocket service, the stream begins with a StartEvent, continues with MediaInfo events, and finishes with a StopEvent.

If several streams are sent at the same time, use the tag field on StartEvent, MediaInfo, and StopEvent to group chunks by stream.

Use the same tag value across the start, media, and stop events for one stream. If a WebSocket carries multiple streams, group chunks by tag.

StartEvent

StartEvent is generated by sendMediaTo(...). For example, this sends call audio to a WebSocket:

1 call.sendMediaTo(webSocket);

The WebSocket receives:

1 {
2   event: "start",
3   sequenceNumber: 0,
4   start: {
5     mediaFormat: {
6       encoding: "PCM16",
7       sampleRate: 8000,
8     },
9   },
10 }

You can also set the stream tag, customParameters, and audio encoding from the scenario:

1 call.sendMediaTo(webSocket, {
2   tag: "stream2",
3   encoding: WebSocketAudioEncoding.PCM16_8KHZ,
4   customParameters: { test: "123" },
5 });

This produces a StartEvent with the chosen tag, media format, and custom parameters:

1 {
2   event: "start",
3   sequenceNumber: 0,
4   start: {
5     tag: "stream2",
6     mediaFormat: {
7       encoding: "PCM16",
8       sampleRate: 8000,
9     },
10     customParameters: "{\"test\":\"123\"}",
11   },
12 }

MediaInfo

MediaInfo events deliver audio chunks. The audio data is stored in media.payload.

This example contains a 20 ms audio chunk encoded as PCM16 at 8000 Hz:

1 {
2   event: "media",
3   sequenceNumber: 4,
4   media: {
5     timestamp: 18880,
6     chunk: 3,
7     payload: "AAAAAAAAAAAAAAAAAAAAAA==",
8   },
9 }

MediaInfo.payload uses the codec specified in StartEvent.mediaFormat.

Audio duration

The audio duration inside MediaInfo.payload depends on the source: a call, player, WebSocket, or another media unit.

For call audio, the chunk length should correspond to the a=ptime and a=maxptime attributes in the call SDP. In most cases, MediaInfo.payload contains about 20 ms of audio.

After decoding the payload into MediaCodec.Codec.PCM16 at the same sample rate, calculate the duration in milliseconds as:

1 sizeInBytes(payloadInPcm16) / 2 * 1000 / StartEvent.mediaFormat.sampleRate

chunk and timestamp

MediaInfo.chunk and MediaInfo.timestamp correspond to the sequence number and timestamp fields of the RTP header.

Keep these details in mind:

Unlike RFC 3550, chunk and timestamp use uint64 width.
chunk values can be skipped. This means some RTP packets sent over the WebSocket were lost in the network.
Your application should handle lost chunks, for example by using a Packet Loss Concealment (PLC) mechanism.
Voximplant uses an adaptive jitter buffer to handle duplicates and reordered packets, but it does not guarantee that every chunk is delivered in the correct order. See RFC 3550 for RTP sequence number and timestamp details.

Calculate the number of lost chunks as:

1 Current.MediaInfo.chunk - LastReceived.MediaInfo.chunk - 1

tag

Use MediaInfo.tag when more than one media stream is transmitted over the same WebSocket connection. Your application should demultiplex MediaInfo.payload into separate media streams by grouping chunks with the same tag.

StopEvent

StopEvent marks the end of MediaInfo events for a stream. After this event, the next stream starts with a new StartEvent.

StopEvent.tag identifies the stream, and StopEvent.mediaInfo contains statistics for the completed stream:

1 {
2   event: "stop",
3   tag: "stream2",
4   sequenceNumber: 1009,
5   stop: {
6     mediaInfo: {
7       bytesSent: 645120,
8       duration: 340800,
9     },
10   },
11 }

Attaching WebSocket audio to a call

A WebSocket can also send audio back to Voximplant and route it into a call, conference, recorder, or another media unit. This example accepts an incoming WebSocket connection and forwards audio from that WebSocket to the active call with websocket.sendMediaTo(inCall):

1 let inCall;
2 let webSocketUrl;
3 
4 VoxEngine.addEventListener(AppEvents.Started, ({ accessSecureURL }) => {
5   webSocketUrl = accessSecureURL.replace("https", "wss");
6   VoxEngine.allowWebSocketConnections();
7 });
8 
9 VoxEngine.addEventListener(AppEvents.CallAlerting, ({ call }) => {
10   inCall = call;
11   call.answer();
12 
13   call.sendMessage(`use websocket url ${webSocketUrl}`);
14 
15   call.addEventListener(CallEvents.Disconnected, VoxEngine.terminate);
16   call.addEventListener(CallEvents.Failed, VoxEngine.terminate);
17 });
18 
19 VoxEngine.addEventListener(AppEvents.WebSocket, ({ websocket }) => {
20   websocket.addEventListener(WebSocketEvents.ERROR, () => {
21     Logger.write("Incoming WebSocket error");
22   });
23 
24   websocket.addEventListener(WebSocketEvents.CLOSE, ({ reason }) => {
25     Logger.write(`Incoming WebSocket closed: ${reason}`);
26   });
27 
28   websocket.addEventListener(WebSocketEvents.MESSAGE, ({ text }) => {
29     Logger.write(`Incoming WebSocket message: ${text}`);
30   });
31 
32   websocket.sendMediaTo(inCall);
33 });

After the external service receives the wss URL, it can connect and send audio stream events to the call:

1 node server-code.js wss://example-link

Example Node.js audio sender

This example sends a raw 8 kHz mu-law audio file named sample into the WebSocket connection.

1 const fs = require("node:fs");
2 const WebSocketClient = require("websocket").client;
3 
4 if (process.argv.length !== 3) {
5   process.exit(1);
6 }
7 
8 const wsUrl = process.argv[2];
9 const client = new WebSocketClient();
10 
11 let startTime = null;
12 
13 client.on("connectFailed", (error) => {
14   console.log(`Connect error: ${error.toString()}`);
15   process.exit(1);
16 });
17 
18 client.on("connect", (connection) => {
19   console.log("Connection established");
20 
21   connection.on("error", (error) => {
22     console.log(`Connection error: ${error.toString()}`);
23     process.exit(1);
24   });
25 
26   connection.on("close", () => {
27     console.log("Connection closed");
28     process.exit(0);
29   });
30 
31   const startEvent = {
32     event: "start",
33     sequenceNumber: 0,
34     start: {
35       mediaFormat: {
36         encoding: "ULAW",
37         sampleRate: 8000,
38       },
39     },
40   };
41 
42   connection.send(JSON.stringify(startEvent));
43 
44   const CHUNK_DURATION_MS = 20;
45   const CHUNK_SIZE = 160;
46   const buffer = Buffer.alloc(CHUNK_SIZE);
47   let sequenceNumber = 1;
48   let mediaChunkIndex = 0;
49 
50   fs.open("sample", "r", (openError, fd) => {
51     if (openError) {
52       throw openError;
53     }
54 
55     function readNextChunk() {
56       fs.read(fd, buffer, 0, CHUNK_SIZE, null, (readError, bytesRead) => {
57         if (readError) {
58           throw readError;
59         }
60 
61         if (bytesRead < CHUNK_SIZE) {
62           const stopEvent = {
63             event: "stop",
64             sequenceNumber,
65             stop: {},
66           };
67 
68           connection.send(JSON.stringify(stopEvent));
69           process.exit(0);
70           return;
71         }
72 
73         const mediaEvent = {
74           event: "media",
75           sequenceNumber,
76           media: {
77             chunk: mediaChunkIndex,
78             payload: buffer.toString("base64"),
79             timestamp: mediaChunkIndex * CHUNK_SIZE,
80           },
81         };
82 
83         connection.send(JSON.stringify(mediaEvent));
84         sequenceNumber += 1;
85         mediaChunkIndex += 1;
86 
87         const now = Date.now();
88         if (startTime === null) {
89           startTime = now;
90         }
91 
92         const nextChunkTime = startTime + CHUNK_DURATION_MS * mediaChunkIndex;
93         setTimeout(readNextChunk, Math.max(nextChunkTime - now, 0));
94       });
95     }
96 
97     readNextChunk();
98   });
99 });
100 
101 console.log(`Connect to ${wsUrl}`);
102 client.connect(wsUrl);

Convert source audio to raw 8 kHz mu-law mono with FFmpeg:

1 ffmpeg -i ./record.mp3 -f mulaw -acodec pcm_mulaw -ac 1 -ar 8000 sample

Sending payload and parameter details

A WebSocket can send audio back to Voximplant and route it to a call, conference, recorder, or another media unit.

This sends audio from a WebSocket into a call:

1 webSocket.sendMediaTo(call);

If you send several streams, assign a unique tag to each one:

1 webSocket.sendMediaTo(call, {
2   tag: "stream1",
3 });
4 
5 webSocket.sendMediaTo(recorder, {
6   tag: "stream2",
7 });

Send StartEvent

The first event in the stream must be a StartEvent. It specifies the media stream codec in StartEvent.mediaFormat.

If StartEvent is valid, VoxEngine triggers WebSocketEvents.MEDIA_STARTED in the scenario. The StartEvent.tag, StartEvent.customParameters, and StartEvent.mediaFormat fields are reflected in the event as tag, customParameters, and encoding.

Send MediaInfo

Split the stream into media chunks with these recommendations:

The duration of each MediaInfo.payload can be arbitrary, but a multiple of 20 ms is recommended.
You do not need to send chunks in real time. For example, you can send all chunks at once; Voximplant stores them in the WebSocket buffer and forwards them to the target media unit in real time.
The maximum WebSocket buffer size is 10 seconds. If the limit is reached, extra audio chunks are discarded.

The media chunk must be stored in MediaInfo.payload, and its encoding must match StartEvent.mediaFormat.

Sending faster than real time is useful for file playback and generated audio. Voximplant plays buffered WebSocket media into the target media unit in real time, and clearMediaBuffer(...) can interrupt queued playback.

Send chunk and timestamp

MediaInfo.chunk and MediaInfo.timestamp correspond to the RTP sequence number and timestamp fields.

If the stream does not contain lost, duplicated, or reordered chunks:

Increment MediaInfo.chunk by 1 for each next packet.
Calculate MediaInfo.timestamp as the sum of samples in the previous chunks.

The number of samples in a PCM16 audio chunk is:

1 sizeInBytes(payloadInPcm16) / 2

If the stream includes lost, duplicated, or reordered chunks, MediaInfo.chunk and MediaInfo.timestamp should accurately reflect those losses, duplications, and reorderings.

For a normal stream, keep chunk and timestamp monotonic. Only send skipped, duplicated, or reordered values when your application needs to represent packet loss, duplication, or reordering.

Send StopEvent

When all media chunks are sent, send a StopEvent.

If the corresponding StartEvent included a tag, the StopEvent.tag value must match it exactly. A valid StopEvent triggers WebSocketEvents.MEDIA_ENDED in the scenario.

Changing the media codec

Do not change the codec specified in StartEvent.mediaFormat within the same stream.

To switch codecs, stop the current stream with a StopEvent, then start a new stream with another StartEvent and the new mediaFormat.

The codec declared in StartEvent.mediaFormat applies to every following MediaInfo.payload until the matching StopEvent.