Voice Activity Detection

For the complete documentation index, see llms.txt.

Benefits

Voice Activity Detection (VAD) gives your VoxEngine scenario a fast signal for when a caller starts and stops speaking. In practice, that is what powers barge-in, pause detection, and the first layer of turn-taking.

Capability highlights:

Powered by Voximplant’s native Silero module.
Works directly on call media inside VoxEngine without a custom media gateway.
Emits speech-boundary events you can use to stop TTS playback and trigger downstream turn detection.
Pairs naturally with STT and Pipecat Smart Turn in full-cascade voice pipelines.

Architecture

What Voximplant exposes

Load the module and create a VAD instance:

1 require(Modules.Silero);
2 
3 const vad = await Silero.createVAD({
4   threshold: 0.5,
5   minSilenceDurationMs: 300,
6   speechPadMs: 10,
7 });
8 
9 call.sendMediaTo(vad);

The module surface is intentionally small:

Silero.createVAD(parameters)
vad.addEventListener(...)
vad.reset()
vad.close()
vad.id() and vad.webSocketId() for diagnostics

VAD parameters

These are the parameters Voximplant currently exposes on Silero.createVAD().

Parameter	Type	Default	What it controls
`threshold`	`number`	`0.5`	Speech probability threshold above which Voximplant treats the segment as speech. Raise it to reduce false positives; lower it to react earlier to softer speech.
`minSilenceDurationMs`	`number`	`300`	Silence duration required before Voximplant emits `speechEndAt` for a segment. Raise it if short pauses are splitting turns too aggressively.
`speechPadMs`	`number`	`0`	Padding added around detected speech segments so the boundaries are less aggressive. Useful when speech starts or ends are getting clipped.

VAD events

These are the core events exposed by the Silero module.

Event	Payload	How to use it
`Silero.VADEvents.Result`	`speechStartAt?`, `speechEndAt?`	Main speech-boundary signal. In Voximplant typings these timestamps are documented in seconds. Use `speechStartAt` for barge-in and `speechEndAt` to start pause-based logic.
`Silero.VADEvents.Reset`	none	Emitted after `vad.reset()`. Useful when you need to clear VAD state between turns or after call state changes.
`Silero.VADEvents.Error`	`reason`	Connector/runtime error path. Log it and fail fast or fall back cleanly.
`Silero.VADEvents.ConnectorInformation`	`data`	Connector metadata and diagnostics. Useful for confirming the loaded connector version and endpoint.

Development notes

Use VAD for speech boundaries, not transcripts. Silero tells you that speech is happening; STT still owns the actual words.
Barge-in usually starts with speechStartAt. In a phone assistant flow, that is the right moment to clear any queued agent audio.
Pause sensitivity is mostly minSilenceDurationMs. If callers pause naturally mid-sentence, this is usually the first value to tune.
speechPadMs is a safety margin. It helps avoid clipped starts and ends, but too much padding can make boundaries feel slower.
VAD and turn detection are complementary. A common pattern is: Silero detects speech boundaries, then Pipecat evaluates whether the turn is actually complete.
This matters most in cascade architectures. When STT, LLM, and TTS come from different vendors, VAD is usually the first signal that keeps interruptions responsive and prevents the agent from talking over the caller.

Example

1 require(Modules.Silero);
2 
3 const vad = await Silero.createVAD({
4   threshold: 0.5,
5   minSilenceDurationMs: 300,
6   speechPadMs: 10,
7 });
8 
9 call.sendMediaTo(vad);
10 
11 vad.addEventListener(Silero.VADEvents.Result, (event) => {
12   if (event.speechStartAt) {
13     ttsPlayer.clearBuffer();
14   }
15 
16   if (event.speechEndAt) {
17     turnDetector.predict();
18   }
19 });