Voice Activity Detection
Silero VAD in VoxEngine for barge-in and speech-boundary signals
Benefits
Voice Activity Detection (VAD) gives your VoxEngine scenario a fast signal for when a caller starts and stops speaking. In practice, that is what powers barge-in, pause detection, and the first layer of turn-taking.
Capability highlights:
- Powered by Voximplant’s native Silero module.
- Works directly on call media inside VoxEngine without a custom media gateway.
- Emits speech-boundary events you can use to stop TTS playback and trigger downstream turn detection.
- Pairs naturally with STT and Pipecat Smart Turn in full-cascade voice pipelines.
Architecture
What Voximplant exposes
Load the module and create a VAD instance:
The module surface is intentionally small:
Silero.createVAD(parameters)vad.addEventListener(...)vad.reset()vad.close()vad.id()andvad.webSocketId()for diagnostics
VAD parameters
These are the parameters Voximplant currently exposes on Silero.createVAD().
VAD events
These are the core events exposed by the Silero module.
Development notes
- Use VAD for speech boundaries, not transcripts. Silero tells you that speech is happening; STT still owns the actual words.
- Barge-in usually starts with
speechStartAt. In a phone assistant flow, that is the right moment to clear any queued agent audio. - Pause sensitivity is mostly
minSilenceDurationMs. If callers pause naturally mid-sentence, this is usually the first value to tune. speechPadMsis a safety margin. It helps avoid clipped starts and ends, but too much padding can make boundaries feel slower.- VAD and turn detection are complementary. A common pattern is: Silero detects speech boundaries, then Pipecat evaluates whether the turn is actually complete.
- This matters most in cascade architectures. When STT, LLM, and TTS come from different vendors, VAD is usually the first signal that keeps interruptions responsive and prevents the agent from talking over the caller.