Voice Activity Detection

Silero VAD in VoxEngine for barge-in and speech-boundary signals

View as Markdown

Benefits

Voice Activity Detection (VAD) gives your VoxEngine scenario a fast signal for when a caller starts and stops speaking. In practice, that is what powers barge-in, pause detection, and the first layer of turn-taking.

Capability highlights:

  • Powered by Voximplant’s native Silero module.
  • Works directly on call media inside VoxEngine without a custom media gateway.
  • Emits speech-boundary events you can use to stop TTS playback and trigger downstream turn detection.
  • Pairs naturally with STT and Pipecat Smart Turn in full-cascade voice pipelines.

Architecture

What Voximplant exposes

Load the module and create a VAD instance:

1require(Modules.Silero);
2
3const vad = await Silero.createVAD({
4 threshold: 0.5,
5 minSilenceDurationMs: 300,
6 speechPadMs: 10,
7});
8
9call.sendMediaTo(vad);

The module surface is intentionally small:

  • Silero.createVAD(parameters)
  • vad.addEventListener(...)
  • vad.reset()
  • vad.close()
  • vad.id() and vad.webSocketId() for diagnostics

VAD parameters

These are the parameters Voximplant currently exposes on Silero.createVAD().

ParameterTypeDefaultWhat it controls
thresholdnumber0.5Speech probability threshold above which Voximplant treats the segment as speech. Raise it to reduce false positives; lower it to react earlier to softer speech.
minSilenceDurationMsnumber300Silence duration required before Voximplant emits speechEndAt for a segment. Raise it if short pauses are splitting turns too aggressively.
speechPadMsnumber0Padding added around detected speech segments so the boundaries are less aggressive. Useful when speech starts or ends are getting clipped.

VAD events

These are the core events exposed by the Silero module.

EventPayloadHow to use it
Silero.VADEvents.ResultspeechStartAt?, speechEndAt?Main speech-boundary signal. In Voximplant typings these timestamps are documented in seconds. Use speechStartAt for barge-in and speechEndAt to start pause-based logic.
Silero.VADEvents.ResetnoneEmitted after vad.reset(). Useful when you need to clear VAD state between turns or after call state changes.
Silero.VADEvents.ErrorreasonConnector/runtime error path. Log it and fail fast or fall back cleanly.
Silero.VADEvents.ConnectorInformationdataConnector metadata and diagnostics. Useful for confirming the loaded connector version and endpoint.

Development notes

  • Use VAD for speech boundaries, not transcripts. Silero tells you that speech is happening; STT still owns the actual words.
  • Barge-in usually starts with speechStartAt. In a phone assistant flow, that is the right moment to clear any queued agent audio.
  • Pause sensitivity is mostly minSilenceDurationMs. If callers pause naturally mid-sentence, this is usually the first value to tune.
  • speechPadMs is a safety margin. It helps avoid clipped starts and ends, but too much padding can make boundaries feel slower.
  • VAD and turn detection are complementary. A common pattern is: Silero detects speech boundaries, then Pipecat evaluates whether the turn is actually complete.
  • This matters most in cascade architectures. When STT, LLM, and TTS come from different vendors, VAD is usually the first signal that keeps interruptions responsive and prevents the agent from talking over the caller.

Example

1require(Modules.Silero);
2
3const vad = await Silero.createVAD({
4 threshold: 0.5,
5 minSilenceDurationMs: 300,
6 speechPadMs: 10,
7});
8
9call.sendMediaTo(vad);
10
11vad.addEventListener(Silero.VADEvents.Result, (event) => {
12 if (event.speechStartAt) {
13 ttsPlayer.clearBuffer();
14 }
15
16 if (event.speechEndAt) {
17 turnDetector.predict();
18 }
19});

Voximplant

Upstream technology