Media over WebSockets
WebSocket is a full-duplex protocol for real-time communication between a client and a third-party service. VoxEngine can create outgoing WebSocket connections, accept incoming WebSocket connections, and exchange text or audio media over either connection type.
Audio over WebSockets is controlled by JSON messages with an event field.
A media stream starts with start, carries audio chunks in media, and ends with stop.
See Media stream format for details on the message schema.
WebSocket Connectivity Overview
Voximplant supports both outgoing and incoming WebSocket connections and can send both text and media data. The interface includes specialized handling for audio media streams for interoperability with calls, conferences, recorders, and other VoxEngine media units.
Outgoing WebSocket connections
Create a WebSocket object with VoxEngine.createWebSocket(...).
Voximplant Cloud performs the WebSocket handshake with the third-party service, then the scenario receives WebSocketEvents.OPEN.
After the connection is open:
- Send strings and stringified data with
webSocket.send(...). - Send call audio to the WebSocket with
call.sendMediaTo(webSocket, parameters). UseSendMediaParametersto setencoding,tag, andcustomParameters. Ifencodingis omitted,WebSocketAudioEncodingdefaults toPCM16_8KHZ. - Stop media with
stopMediaTo(...)or close the connection withwebSocket.close().
The minimal setup for an outgoing connection looks like this:
Existing WebSocket connections are not automatically destroyed when a call ends.
Make sure to close outgoing and incoming WebSocket connections with webSocket.close() when the scenario no longer needs them.
A WebSocket can also be closed by Voximplant Cloud or by the third-party service, so handle WebSocketEvents.CLOSE in every flow.
Incoming WebSocket connections
Enable incoming WebSocket access with VoxEngine.allowWebSocketConnections(), then subscribe to AppEvents.WebSocket. The third-party service connects to the session WebSocket URL, and VoxEngine gives the scenario a websocket object for that connection.
Get the session URL from the StartScenarios Management API response or from AppEvents.Started. Replace https with wss before giving the URL to the external service.
The minimal setup for accepting an incoming WebSocket connection looks like this:
The maximum number of incoming WebSocket connections cannot be greater than the number of calls in the session plus 3.
If a session receives one more connection, VoxEngine triggers AppEvents.NewWebSocketFailed.
Sending text data
Use WebSocket.send(...) to enqueue string data for transmission over the WebSocket connection.
For structured messages, serialize your payload before calling send.
Audio media itself is exchanged as JSON start, media, and stop events; see Media stream format.
Optional echo server for text testing
Working with Audio Media
VoxEngine media streams use JSON messages to describe audio sent over WebSocket connections. Use this format when a custom application receives call audio from Voximplant or sends audio back into a call, conference, recorder, or another media unit.
Media streams are started and stopped from a VoxEngine scenario with methods such as sendMediaTo(...) and stopMediaTo(...).
Each stream starts with a start event, carries audio chunks in media events, and ends with a stop event.
The event field is reserved for the system media-stream events (start, media, stop) and is mandatory for them.
For your own application messages, use a customEvent field instead. Do not combine event and customEvent in the same message.
Sending call audio to a WebSocket
Use call.sendMediaTo(webSocket, parameters) to stream call audio to a WebSocket.
The optional SendMediaParameters object lets you set:
encoding: aWebSocketAudioEncodingvalue. The default isPCM16_8KHZ.tag: a label for matchingstart,media, andstopevents that belong to the same stream.customParameters: application-specific metadata included in the streamstartevent.
Receiving payload and parameter details
When Voximplant sends audio to your WebSocket service, the stream begins with a StartEvent, continues with MediaInfo events, and finishes with a StopEvent.
If several streams are sent at the same time, use the tag field on StartEvent, MediaInfo, and StopEvent to group chunks by stream.
Use the same tag value across the start, media, and stop events for one stream. If a WebSocket carries multiple streams, group chunks by tag.
StartEvent
StartEvent is generated by sendMediaTo(...). For example, this sends call audio to a WebSocket:
The WebSocket receives:
You can also set the stream tag, customParameters, and audio encoding from the scenario:
This produces a StartEvent with the chosen tag, media format, and custom parameters:
MediaInfo
MediaInfo events deliver audio chunks. The audio data is stored in media.payload.
This example contains a 20 ms audio chunk encoded as PCM16 at 8000 Hz:
MediaInfo.payload uses the codec specified in StartEvent.mediaFormat.
Audio duration
The audio duration inside MediaInfo.payload depends on the source: a call, player, WebSocket, or another media unit.
For call audio, the chunk length should correspond to the a=ptime and a=maxptime attributes in the call SDP. In most cases, MediaInfo.payload contains about 20 ms of audio.
After decoding the payload into MediaCodec.Codec.PCM16 at the same sample rate, calculate the duration in milliseconds as:
chunk and timestamp
MediaInfo.chunk and MediaInfo.timestamp correspond to the sequence number and timestamp fields of the RTP header.
Keep these details in mind:
- Unlike RFC 3550,
chunkandtimestampuseuint64width. chunkvalues can be skipped. This means some RTP packets sent over the WebSocket were lost in the network.- Your application should handle lost chunks, for example by using a Packet Loss Concealment (PLC) mechanism.
- Voximplant uses an adaptive jitter buffer to handle duplicates and reordered packets, but it does not guarantee that every chunk is delivered in the correct order. See RFC 3550 for RTP sequence number and timestamp details.
Calculate the number of lost chunks as:
tag
Use MediaInfo.tag when more than one media stream is transmitted over the same WebSocket connection. Your application should demultiplex MediaInfo.payload into separate media streams by grouping chunks with the same tag.
StopEvent
StopEvent marks the end of MediaInfo events for a stream. After this event, the next stream starts with a new StartEvent.
StopEvent.tag identifies the stream, and StopEvent.mediaInfo contains statistics for the completed stream:
Attaching WebSocket audio to a call
A WebSocket can also send audio back to Voximplant and route it into a call, conference, recorder, or another media unit.
This example accepts an incoming WebSocket connection and forwards audio from that WebSocket to the active call with websocket.sendMediaTo(inCall):
After the external service receives the wss URL, it can connect and send audio stream events to the call:
Example Node.js audio sender
This example sends a raw 8 kHz mu-law audio file named sample into the WebSocket connection.
Convert source audio to raw 8 kHz mu-law mono with FFmpeg:
Sending payload and parameter details
A WebSocket can send audio back to Voximplant and route it to a call, conference, recorder, or another media unit.
This sends audio from a WebSocket into a call:
If you send several streams, assign a unique tag to each one:
Send StartEvent
The first event in the stream must be a StartEvent. It specifies the media stream codec in StartEvent.mediaFormat.
If StartEvent is valid, VoxEngine triggers WebSocketEvents.MEDIA_STARTED in the scenario. The StartEvent.tag, StartEvent.customParameters, and StartEvent.mediaFormat fields are reflected in the event as tag, customParameters, and encoding.
Send MediaInfo
Split the stream into media chunks with these recommendations:
- The duration of each
MediaInfo.payloadcan be arbitrary, but a multiple of 20 ms is recommended. - You do not need to send chunks in real time. For example, you can send all chunks at once; Voximplant stores them in the WebSocket buffer and forwards them to the target media unit in real time.
- The maximum WebSocket buffer size is 10 seconds. If the limit is reached, extra audio chunks are discarded.
The media chunk must be stored in MediaInfo.payload, and its encoding must match StartEvent.mediaFormat.
Sending faster than real time is useful for file playback and generated audio. Voximplant plays buffered WebSocket media into the target media unit in real time, and clearMediaBuffer(...) can interrupt queued playback.
Send chunk and timestamp
MediaInfo.chunk and MediaInfo.timestamp correspond to the RTP sequence number and timestamp fields.
If the stream does not contain lost, duplicated, or reordered chunks:
- Increment
MediaInfo.chunkby 1 for each next packet. - Calculate
MediaInfo.timestampas the sum of samples in the previous chunks.
The number of samples in a PCM16 audio chunk is:
If the stream includes lost, duplicated, or reordered chunks, MediaInfo.chunk and MediaInfo.timestamp should accurately reflect those losses, duplications, and reorderings.
For a normal stream, keep chunk and timestamp monotonic. Only send skipped, duplicated, or reordered values when your application needs to represent packet loss, duplication, or reordering.
Send StopEvent
When all media chunks are sent, send a StopEvent.
If the corresponding StartEvent included a tag, the StopEvent.tag value must match it exactly. A valid StopEvent triggers WebSocketEvents.MEDIA_ENDED in the scenario.
Changing the media codec
Do not change the codec specified in StartEvent.mediaFormat within the same stream.
To switch codecs, stop the current stream with a StopEvent, then start a new stream with another StartEvent and the new mediaFormat.
The codec declared in StartEvent.mediaFormat applies to every following MediaInfo.payload until the matching StopEvent.