***

title: Voice Activity Detection
subtitle: Silero VAD in VoxEngine for barge-in and speech-boundary signals
--------------------------------------------------------------------------

## Benefits

Voice Activity Detection (VAD) gives your VoxEngine scenario a fast signal for when a caller starts and stops speaking.
In practice, that is what powers barge-in, pause detection, and the first layer of turn-taking.

Capability highlights:

* Powered by Voximplant's native [Silero module](https://voximplant.com/docs/references/voxengine/silero).
* Works directly on call media inside VoxEngine without a custom media gateway.
* Emits speech-boundary events you can use to stop TTS playback and trigger downstream turn detection.
* Pairs naturally with STT and Pipecat Smart Turn in full-cascade voice pipelines.

## Architecture

```mermaid
graph LR
  Caller["Caller audio"] --> Call["VoxEngine call media"]
  Call --> VAD["Silero VAD"]
  VAD --> Events["speechStartAt / speechEndAt"]
  Events --> Logic["Barge-in, timers, turn policy"]
```

## What Voximplant exposes

Load the module and create a VAD instance:

```js
require(Modules.Silero);

const vad = await Silero.createVAD({
  threshold: 0.5,
  minSilenceDurationMs: 300,
  speechPadMs: 10,
});

call.sendMediaTo(vad);
```

The module surface is intentionally small:

* `Silero.createVAD(parameters)`
* `vad.addEventListener(...)`
* `vad.reset()`
* `vad.close()`
* `vad.id()` and `vad.webSocketId()` for diagnostics

## VAD parameters

These are the parameters Voximplant currently exposes on `Silero.createVAD()`.

| Parameter              | Type     | Default | What it controls                                                                                                                                                  |
| ---------------------- | -------- | ------: | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `threshold`            | `number` |   `0.5` | Speech probability threshold above which Voximplant treats the segment as speech. Raise it to reduce false positives; lower it to react earlier to softer speech. |
| `minSilenceDurationMs` | `number` |   `300` | Silence duration required before Voximplant emits `speechEndAt` for a segment. Raise it if short pauses are splitting turns too aggressively.                     |
| `speechPadMs`          | `number` |     `0` | Padding added around detected speech segments so the boundaries are less aggressive. Useful when speech starts or ends are getting clipped.                       |

## VAD events

These are the core events exposed by the Silero module.

| Event                                   | Payload                          | How to use it                                                                                                                                                                 |
| --------------------------------------- | -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `Silero.VADEvents.Result`               | `speechStartAt?`, `speechEndAt?` | Main speech-boundary signal. In Voximplant typings these timestamps are documented in seconds. Use `speechStartAt` for barge-in and `speechEndAt` to start pause-based logic. |
| `Silero.VADEvents.Reset`                | none                             | Emitted after `vad.reset()`. Useful when you need to clear VAD state between turns or after call state changes.                                                               |
| `Silero.VADEvents.Error`                | `reason`                         | Connector/runtime error path. Log it and fail fast or fall back cleanly.                                                                                                      |
| `Silero.VADEvents.ConnectorInformation` | `data`                           | Connector metadata and diagnostics. Useful for confirming the loaded connector version and endpoint.                                                                          |

## Development notes

* **Use VAD for speech boundaries, not transcripts**. Silero tells you that speech is happening; STT still owns the actual words.
* **Barge-in usually starts with `speechStartAt`**. In a phone assistant flow, that is the right moment to clear any queued agent audio.
* **Pause sensitivity is mostly `minSilenceDurationMs`**. If callers pause naturally mid-sentence, this is usually the first value to tune.
* **`speechPadMs` is a safety margin**. It helps avoid clipped starts and ends, but too much padding can make boundaries feel slower.
* **VAD and turn detection are complementary**. A common pattern is: Silero detects speech boundaries, then Pipecat evaluates whether the turn is actually complete.
* **This matters most in cascade architectures**. When STT, LLM, and TTS come from different vendors, VAD is usually the first signal that keeps interruptions responsive and prevents the agent from talking over the caller.

## Example

```js
require(Modules.Silero);

const vad = await Silero.createVAD({
  threshold: 0.5,
  minSilenceDurationMs: 300,
  speechPadMs: 10,
});

call.sendMediaTo(vad);

vad.addEventListener(Silero.VADEvents.Result, (event) => {
  if (event.speechStartAt) {
    ttsPlayer.clearBuffer();
  }

  if (event.speechEndAt) {
    turnDetector.predict();
  }
});
```

## Links

### Voximplant

* [VAD and Turn Detection product page](https://voximplant.com/products/turn-detection)
* [VAD and Turn Detection guide](https://voximplant.com/docs/guides/speech/vad-turn-detection)
* [Silero module API reference](https://voximplant.com/docs/references/voxengine/silero)

### Upstream technology

* [Silero](https://github.com/snakers4/silero-vad)