Turn Taking Helper Library
Reusable VoxEngine helper for STT + VAD + Pipecat Smart Turn pipelines
Overview
VoxTurnTaking is a reusable VoxEngine helper scenario that packages the current turn-taking policy used in full-cascade examples.
It exists to keep your main scenario focused on STT, LLM, and TTS wiring while the helper owns the mechanics of barge-in, transcript accumulation, end-of-turn prediction, and fallback timing.
The design goal is pragmatic:
- simplify turn-taking implementations in VoxEngine scenarios
- add key timers and policy controls that are common in production turn-taking flows
- stay close to Pipecat Smart Turn concepts
How it works
A user turn stays open until the helper decides to submit it. Silero VAD, Pipecat turn detection, STT results, and local timeouts are only signals that help make that decision.
Turn lifecycle
The helper keeps one open user turn in memory. As transcripts arrive, it accumulates them until there is enough evidence to submit the turn to your LLM.
The main signals are:
- Silero VAD: detects when speech starts and stops
- Pipecat turn detection: predicts whether the caller is probably finished
- STT: provides interim and final transcript text
- local timers: handle grace periods and fallback behavior when STT or turn detection are late or uncertain
The turn does not end when VAD fires or when Pipecat returns endOfTurn.
Those events are only evidence that the user may be finished.
This distinction matters because both signals can be early or incomplete:
- VAD only knows that speech stopped, not whether the thought is complete
- Pipecat predicts end-of-turn, but STT may still be landing the final words
- callers often pause, restart, self-correct, or begin with short disfluencies such as
umorandthat indicate they are formulating/continuing a thought rather than finishing
If the helper treated those signals as hard turn boundaries, it would submit too many clipped or fragmented turns.
Instead, VoxTurnTaking keeps the turn open until it has enough evidence to actually commit the transcript and call onUserTurn(input, reason).
End-of-turn parameters often vary by language, use case, and LLM behavior.
The helper lets you configure them through the policy option, which is layered on top of Voximplant’s current Silero and Pipecat surface.
Barge-in handling
In a full-cascade pipeline, barge-in is not just “detect speech and stop audio.” There are two separate responsibilities:
- The helper detects the interruption.
VoxTurnTakinglistens forspeechStartAtfrom Silero VAD. When new user speech starts, it marks agent audio as no longer safe to forward and calls youronInterrupt()callback. - Your scenario stops playback. Inside
onInterrupt(), the consuming scenario should clear any queued or currently buffered TTS playback for the active turn.
The helper owns turn state and interruption policy. You should choose an interruption approach based on your TTS provider and playback policy.
In practice, the barge-in path often looks like this:
- The agent is speaking and response text is being streamed to TTS
- The caller starts talking
- Silero emits
speechStartAt(handled in the helper) VoxTurnTakingmarks agent audio as interrupted and callsonInterrupt()(handled in the helper)- Your scenario clears buffered TTS audio
- Additional stale text deltas from the interrupted response are ignored until the next user turn is submitted.
Agent audio gating
turnTaking.canPlayAgentAudio() is the guard on the downstream text-to-speech path.
It answers a simple question:
should this response text still be spoken to the caller?
The LLM and TTS pipeline are asynchronous -
even after the caller interrupts, the LLM may continue emitting ResponseTextDelta events for the now-stale assistant response.
With this gate:
- once
VoxTurnTakingmarks the agent response as interrupted,canPlayAgentAudio()returnsfalse - the scenario drops those stale deltas instead of forwarding them to TTS
- agent audio only becomes allowed again after the helper submits the next user turn
Policy options
The policy object is where the helper adds practical turn-taking behavior on top of Voximplant’s current Silero and Pipecat surface.
The options fall into a few groups:
- settle timing -
transcriptSettleMsgives STT a short extra window after Pipecat thinks the user may be done - default fallback timing -
userSpeechTimeoutMsis the standard “wait a bit more, then submit” timeout afterspeechEndAt - short-utterance handling -
shortUtteranceExtensionMs,fastShortUtteranceTimeoutMs,shortUtteranceMaxChars, andshortUtteranceMaxWordslet the helper treat short fragments differently from longer turns - replaceable short finals -
lowConfidenceShortUtteranceThresholdkeeps weak short finals open so they can be replaced by later STT - continuation cues -
continuationTokenshelps hold turns open when the user starts with fragments such asand,so, orum
Callers often pause, restart, hedge, and self-correct. These policy controls prevent short fragments from becoming clipped, standalone turns.
Cleanup and shutdown
turnTaking.close() is the helper’s call-scoped shutdown path.
It:
- clears active fallback and settle timers
- closes the Silero VAD instance
- closes the Pipecat turn detector instance
VAD and turn detector connectors are live per-call resources and should be closed to avoid accidental charges. In addition, the helper keeps call-local timers and turn-state flags in memory. Cleanup avoids late timer callbacks or stale events during termination
How to use it
1. Include the helper before your scenario
VoxTurnTaking is implemented as a sequenced VoxEngine scenario, not a JavaScript module.
That matters because VoxEngine routing-rule sequencing shares global scope.
Include vox-turn-taking before your main scenario in the same routing rule sequence.
For the routing-rule setup flow in the Control Panel, see Setup routing.
2. Create STT, LLM, and TTS in the consuming scenario
The helper owns Silero and Pipecat. Your main scenario still owns:
- STT
- LLM / Responses client
- TTS
- agent prompts
- what happens when barge-in interrupts playback
This separation is useful when you are mixing providers in a full-cascade pipeline, for example Deepgram + Groq via the OpenAI Responses client + Inworld.
3. Create the helper runtime
4. Gate agent audio through the helper
5. Clean up on disconnect
Public API
Required options
Optional options
Helper parameters
VAD options
These are forwarded to Silero.createVAD().
Turn detector options
These are forwarded to Pipecat.createTurnDetector().
Helper policy options
These are local helper controls, not native Silero or Pipecat parameters.
Return value
VoxTurnTaking.create() returns an object with:
Turn Taking Helper Code
Related example
The current full-cascade example that consumes this helper is: