Turn Taking Helper Library | Voximplant Voice AI

For the complete documentation index, see llms.txt.

Overview

VoxTurnTaking is a reusable VoxEngine helper scenario that packages the current turn-taking policy used in full-cascade examples. It exists to keep your main scenario focused on STT, LLM, and TTS wiring while the helper owns the mechanics of barge-in, transcript accumulation, end-of-turn prediction, and fallback timing.

The design goal is pragmatic:

simplify turn-taking implementations in VoxEngine scenarios
add key timers and policy controls that are common in production turn-taking flows
stay close to Pipecat Smart Turn concepts

Turn taking helper flow illustration

How it works

A user turn stays open until the helper decides to submit it. Silero VAD, Pipecat turn detection, STT results, and local timeouts are only signals that help make that decision.

Turn lifecycle

The helper keeps one open user turn in memory. As transcripts arrive, it accumulates them until there is enough evidence to submit the turn to your LLM.

The main signals are:

Silero VAD: detects when speech starts and stops
Pipecat turn detection: predicts whether the caller is probably finished
STT: provides interim and final transcript text
local timers: handle grace periods and fallback behavior when STT or turn detection are late or uncertain

The turn does not end when VAD fires or when Pipecat returns endOfTurn. Those events are only evidence that the user may be finished.

This distinction matters because both signals can be early or incomplete:

VAD only knows that speech stopped, not whether the thought is complete
Pipecat predicts end-of-turn, but STT may still be landing the final words
callers often pause, restart, self-correct, or begin with short disfluencies such as um or and that indicate they are formulating/continuing a thought rather than finishing

If the helper treated those signals as hard turn boundaries, it would submit too many clipped or fragmented turns. Instead, VoxTurnTaking keeps the turn open until it has enough evidence to actually commit the transcript and call onUserTurn(input, reason).

End-of-turn parameters often vary by language, use case, and LLM behavior. The helper lets you configure them through the policy option, which is layered on top of Voximplant’s current Silero and Pipecat surface.

Barge-in handling

In a full-cascade pipeline, barge-in is not just “detect speech and stop audio.” There are two separate responsibilities:

The helper detects the interruption. VoxTurnTaking listens for speechStartAt from Silero VAD. When new user speech starts, it marks agent audio as no longer safe to forward and calls your onInterrupt() callback.
Your scenario stops playback. Inside onInterrupt(), the consuming scenario should clear any queued or currently buffered TTS playback for the active turn.

The helper owns turn state and interruption policy. You should choose an interruption approach based on your TTS provider and playback policy.

In practice, the barge-in path often looks like this:

The agent is speaking and response text is being streamed to TTS
The caller starts talking
Silero emits speechStartAt (handled in the helper)
VoxTurnTaking marks agent audio as interrupted and calls onInterrupt() (handled in the helper)
Your scenario clears buffered TTS audio
Additional stale text deltas from the interrupted response are ignored until the next user turn is submitted.

Agent audio gating

turnTaking.canPlayAgentAudio() is the guard on the downstream text-to-speech path. It answers a simple question:

should this response text still be spoken to the caller?

The LLM and TTS pipeline are asynchronous - even after the caller interrupts, the LLM may continue emitting ResponseTextDelta events for the now-stale assistant response.

With this gate:

once VoxTurnTaking marks the agent response as interrupted, canPlayAgentAudio() returns false
the scenario drops those stale deltas instead of forwarding them to TTS
agent audio only becomes allowed again after the helper submits the next user turn

Policy options

The policy object is where the helper adds practical turn-taking behavior on top of Voximplant’s current Silero and Pipecat surface.

The options fall into a few groups:

settle timing - transcriptSettleMs gives STT a short extra window after Pipecat thinks the user may be done
default fallback timing - userSpeechTimeoutMs is the standard “wait a bit more, then submit” timeout after speechEndAt
short-utterance handling - shortUtteranceExtensionMs, fastShortUtteranceTimeoutMs, shortUtteranceMaxChars, and shortUtteranceMaxWords let the helper treat short fragments differently from longer turns
replaceable short finals - lowConfidenceShortUtteranceThreshold keeps weak short finals open so they can be replaced by later STT
continuation cues - continuationTokens helps hold turns open when the user starts with fragments such as and, so, or um

Callers often pause, restart, hedge, and self-correct. These policy controls prevent short fragments from becoming clipped, standalone turns.

Cleanup and shutdown

turnTaking.close() is the helper’s call-scoped shutdown path. It:

clears active fallback and settle timers
closes the Silero VAD instance
closes the Pipecat turn detector instance

VAD and turn detector connectors are live per-call resources and should be closed to avoid accidental charges. In addition, the helper keeps call-local timers and turn-state flags in memory. Cleanup avoids late timer callbacks or stale events during termination

How to use it

1. Include the helper before your scenario

VoxTurnTaking is implemented as a sequenced VoxEngine scenario, not a JavaScript module. That matters because VoxEngine routing-rule sequencing shares global scope.

Include vox-turn-taking before your main scenario in the same routing rule sequence.

For the routing-rule setup flow in the Control Panel, see Setup routing.

2. Create STT, LLM, and TTS in the consuming scenario

The helper owns Silero and Pipecat. Your main scenario still owns:

STT
LLM / Responses client
TTS
agent prompts
what happens when barge-in interrupts playback

This separation is useful when you are mixing providers in a full-cascade pipeline, for example Deepgram + Groq via the OpenAI Responses client + Inworld.

3. Create the helper runtime

1 turnTaking = await VoxTurnTaking.create({
2   call,
3   stt,
4   vadOptions: {
5     threshold: 0.5,
6     minSilenceDurationMs: 300,
7     speechPadMs: 10,
8   },
9   turnDetectorOptions: {
10     threshold: 0.5,
11   },
12   policy: {
13     transcriptSettleMs: 500,
14     userSpeechTimeoutMs: 1000,
15     shortUtteranceExtensionMs: 1800,
16     fastShortUtteranceTimeoutMs: 700,
17     shortUtteranceMaxChars: 12,
18     shortUtteranceMaxWords: 2,
19     lowConfidenceShortUtteranceThreshold: 0.75,
20     continuationTokens: ["and", "but", "so", "well", "then", "uh", "um"],
21   },
22   enableLogging: true,
23   onUserTurn: (input, reason) => {
24     responsesClient.createResponses({
25       model: "llama-3.3-70b-versatile",
26       instructions: SYSTEM_PROMPT,
27       input,
28     });
29   },
30   onInterrupt: () => {
31     ttsPlayer.clearBuffer();
32   },
33 });

4. Gate agent audio through the helper

1 responsesClient.addEventListener(
2   OpenAI.ResponsesAPIEvents.ResponseTextDelta,
3   (event) => {
4     const text = event?.data?.payload?.delta;
5     if (!text || !turnTaking.canPlayAgentAudio()) return;
6     ttsPlayer.send({ send_text: { text } });
7   },
8 );

5. Clean up on disconnect

1 call.addEventListener(CallEvents.Disconnected, () => {
2   turnTaking?.close();
3   VoxEngine.terminate();
4 });

Public API

Required options

Option	Type	What it does
`call`	`Call`	Call whose inbound media is analyzed by VAD and turn detection.
`stt`	`ASR`	STT instance created by the consuming scenario. The helper listens for interim and final transcripts.
`onUserTurn`	`function`	Called when the helper decides the current user turn is ready to submit to the LLM.

Optional options

Option	Type	Default	What it does
`onInterrupt`	`function`	none	Called when a new `speechStartAt` indicates barge-in. Use it to stop queued agent audio.
`enableLogging`	`boolean`	`false`	Emits debug logs for turn-taking decisions.
`logger`	`function`	`Logger.write`	Custom logger used when `enableLogging` is enabled.
`vadOptions`	`object`	helper defaults	Passed to `Silero.createVAD()` after merging with defaults.
`turnDetectorOptions`	`object`	helper defaults	Passed to `Pipecat.createTurnDetector()` after merging with defaults.
`policy`	`object`	helper defaults	Local policy layered on top of Voximplant’s current Silero and Pipecat surface.

Helper parameters

VAD options

These are forwarded to Silero.createVAD().

Option	Default	What it does
`threshold`	`0.5`	Speech probability threshold for VAD.
`minSilenceDurationMs`	`300`	Silence duration required before `speechEndAt`.
`speechPadMs`	`10` in the example, `0` in the raw module default	Padding around detected speech segments.

Turn detector options

These are forwarded to Pipecat.createTurnDetector().

Option	Default	What it does
`threshold`	`0.5`	End-of-turn probability threshold used by Pipecat.

Helper policy options

These are local helper controls, not native Silero or Pipecat parameters.

Option	Default	What it does
`transcriptSettleMs`	`500`	Short grace window after Pipecat says the turn may be complete but the final STT chunk has not landed yet.
`userSpeechTimeoutMs`	`1000`	Default fallback timeout after `speechEndAt`.
`shortUtteranceExtensionMs`	`1800`	Longer hold for short fragments that may be followed by more speech.
`fastShortUtteranceTimeoutMs`	`700`	Faster fallback for very short utterances that are likely complete, such as a standalone greeting.
`shortUtteranceMaxChars`	`12`	Maximum length still treated as a short fragment.
`shortUtteranceMaxWords`	`2`	Maximum word count still treated as a short fragment.
`lowConfidenceShortUtteranceThreshold`	`0.75`	Below this confidence, a short final transcript stays replaceable instead of being committed immediately.
`continuationTokens`	`and`, `but`, `so`, `well`, `then`, `uh`, `um`	Short leading words that often indicate the caller is continuing a thought rather than finishing a turn.

Return value

VoxTurnTaking.create() returns an object with:

Field	Type	What it does
`vad`	`Silero.VAD`	The Silero VAD instance created by the helper.
`turnDetector`	`Pipecat.TurnDetector`	The Pipecat detector instance created by the helper.
`canPlayAgentAudio()`	`function`	Returns whether agent audio should still be forwarded to TTS.
`close()`	`function`	Clears timers and closes the helper-owned VAD and turn detector.

Turn Taking Helper Code

voxeengine-vox-turn-taking.js

1 /**
2  * Voximplant turn-taking runtime for sequenced scenarios.
3  *
4  * Include this scenario BEFORE any scenario that wants to use VoxTurnTaking
5  * in the same routing rule sequence.
6  *
7  * This runtime hides the current Silero + Pipecat + timer-based turn policy
8  * behind a small API so scenarios stay simple today and can transition more
9  * easily if Voximplant later exposes a more Pipecat-native Smart Turn model.
10  */
11 
12 require(Modules.ASR);
13 require(Modules.Silero);
14 require(Modules.Pipecat);
15 
16 // eslint-disable-next-line no-unused-vars
17 const VoxTurnTaking = {
18     DEFAULTS: {
19         vadOptions: {
20             threshold: 0.5,
21             minSilenceDurationMs: 300,
22             speechPadMs: 10,
23         },
24         turnDetectorOptions: {
25             threshold: 0.5,
26         },
27         policy: {
28             transcriptSettleMs: 500,
29             userSpeechTimeoutMs: 1000,
30             shortUtteranceExtensionMs: 1800,
31             fastShortUtteranceTimeoutMs: 700,
32             shortUtteranceMaxChars: 12,
33             shortUtteranceMaxWords: 2,
34             lowConfidenceShortUtteranceThreshold: 0.75,
35             continuationTokens: ["and", "but", "so", "well", "then", "uh", "um"],
36         },
37     },
38 
39     /**
40      * Creates a turn-taking controller around a call, STT engine, Silero VAD,
41      * and Pipecat turn detector.
42      *
43      * A user turn stays open until this runtime calls `onUserTurn()`. Silero,
44      * Pipecat, and the timeout policy only provide evidence that the current
45      * turn may be ready to submit.
46      *
47      * @param {object} options
48      * @param {Call} options.call
49      *   Active VoxEngine call whose inbound media should be analyzed.
50      * @param {ASR} options.stt
51      *   Speech-to-text engine already configured by the consuming scenario.
52      * @param {(input: string, reason: string) => void} options.onUserTurn
53      *   Callback invoked when the accumulated user turn should be submitted to
54      *   the LLM.
55      * @param {() => void} [options.onInterrupt]
56      *   Callback invoked on barge-in so the consuming scenario can stop agent
57      *   playback and flush TTS state.
58      * @param {boolean} [options.enableLogging=false]
59      *   When true, emits debug logs for turn-taking decisions. Disabled by
60      *   default so scenarios can keep logs quiet unless they are debugging.
61      * @param {(line: string) => void} [options.logger]
62      *   Optional logger used when `enableLogging` is true.
63      * @param {object} [options.vadOptions]
64      *   Silero VAD options merged over `VoxTurnTaking.DEFAULTS.vadOptions`.
65      * @param {number} [options.vadOptions.threshold]
66      *   Voice activity threshold passed to `Silero.createVAD()`.
67      * @param {number} [options.vadOptions.minSilenceDurationMs]
68      *   Silence required before Silero emits `speechEndAt`.
69      * @param {number} [options.vadOptions.speechPadMs]
70      *   Padding used around detected speech segments.
71      * @param {object} [options.turnDetectorOptions]
72      *   Pipecat options merged over
73      *   `VoxTurnTaking.DEFAULTS.turnDetectorOptions`.
74      * @param {number} [options.turnDetectorOptions.threshold]
75      *   End-of-turn probability threshold passed to
76      *   `Pipecat.createTurnDetector()`.
77      * @param {object} [options.policy]
78      *   Local policy layered on top of Silero and Pipecat to bridge gaps in
79      *   the current API.
80      * @param {number} [options.policy.transcriptSettleMs]
81      *   Extra ASR grace period after Pipecat signals end-of-turn but a final
82      *   transcript chunk has not arrived yet.
83      * @param {number} [options.policy.userSpeechTimeoutMs]
84      *   Default fallback timeout started after `speechEndAt`.
85      * @param {number} [options.policy.shortUtteranceExtensionMs]
86      *   Longer hold time used for short fragments that may be followed by a
87      *   continuation.
88      * @param {number} [options.policy.fastShortUtteranceTimeoutMs]
89      *   Shorter fallback used for brief, high-confidence utterances that are
90      *   likely complete, such as a standalone greeting.
91      * @param {number} [options.policy.shortUtteranceMaxChars]
92      *   Maximum character count considered a short fragment.
93      * @param {number} [options.policy.shortUtteranceMaxWords]
94      *   Maximum word count considered a short fragment.
95      * @param {number} [options.policy.lowConfidenceShortUtteranceThreshold]
96      *   Confidence threshold below which a short final transcript stays
97      *   replaceable instead of being committed immediately.
98      * @param {string[]} [options.policy.continuationTokens]
99      *   Short leading words that usually indicate the caller is continuing a
100      *   thought rather than finishing a turn.
101      * @returns {Promise<object>}
102      * @returns {object} return.vad
103      *   Silero VAD instance created by the runtime.
104      * @returns {object} return.turnDetector
105      *   Pipecat turn detector instance created by the runtime.
106      * @returns {() => boolean} return.canPlayAgentAudio
107      *   Indicates whether agent audio should still be forwarded to TTS.
108      * @returns {() => void} return.close
109      *   Cleans up timers and closes the VAD and turn detector.
110      */
111     async create(options) {
112         const {
113             call,
114             stt,
115             onUserTurn,
116             onInterrupt,
117             enableLogging = false,
118             logger = (line) => Logger.write(line),
119         } = options;
120         const vadOptions = Object.assign({}, this.DEFAULTS.vadOptions, options.vadOptions);
121         const turnDetectorOptions = Object.assign(
122             {},
123             this.DEFAULTS.turnDetectorOptions,
124             options.turnDetectorOptions
125         );
126         const policy = Object.assign({}, this.DEFAULTS.policy, options.policy);
127 
128         const vad = await Silero.createVAD(vadOptions);
129         const turnDetector = await Pipecat.createTurnDetector(turnDetectorOptions);
130 
131         call.sendMediaTo(vad);
132         call.sendMediaTo(turnDetector);
133 
134         const log = (line) => {
135             if (enableLogging) logger(line);
136         };
137         const emitModuleEvent = (eventName, event) => {
138             logger(`===${eventName}===`);
139             if (event) logger(JSON.stringify(event));
140         };
141 
142         let fallbackTimer;
143         let settleTimer;
144         let finalTranscript = "";
145         let interimTranscript = "";
146         let transcriptSeparator = "";
147         let smartTurnComplete = false;
148         let acceptingTranscript = false;
149         let signalVersion = 0;
150         let allowAgentAudio = true;
151         let lastFinalConfidence = 1;
152         let replaceableShortFinal = false;
153         let shortExtensionApplied = false;
154 
155         const clearTimers = () => {
156             if (fallbackTimer) clearTimeout(fallbackTimer);
157             if (settleTimer) clearTimeout(settleTimer);
158             fallbackTimer = null;
159             settleTimer = null;
160         };
161 
162         const normalizeConfidence = (value) => {
163             if (typeof value !== "number" || Number.isNaN(value)) return null;
164             return value > 1 ? value / 100 : value;
165         };
166 
167         const isShortUtterance = (text) => {
168             if (!text) return false;
169             const words = text.trim().split(/\s+/).filter(Boolean);
170             return (
171                 text.length <= policy.shortUtteranceMaxChars &&
172                 words.length <= policy.shortUtteranceMaxWords
173             );
174         };
175 
176         const startsWithContinuationToken = (text) => {
177             if (!text) return false;
178             const firstWord = text.trim().split(/\s+/)[0]?.toLowerCase();
179             return policy.continuationTokens.includes(firstWord);
180         };
181 
182         const buildInput = () => {
183             let input = finalTranscript;
184             if (interimTranscript) {
185                 if (input) input += transcriptSeparator;
186                 input += interimTranscript;
187             }
188             return input.trim();
189         };
190 
191         const submitCurrentTurn = (reason) => {
192             const input = buildInput();
193             if (!input) return false;
194 
195             // Hold short replaceable fragments open for one extra window so
196             // resumed speech can overwrite them. After that extension, submit
197             // the turn instead of looping forever.
198             if (replaceableShortFinal && !shortExtensionApplied) {
199                 shortExtensionApplied = true;
200                 startHardTimeout(signalVersion, policy.shortUtteranceExtensionMs);
201                 return false;
202             }
203 
204             log(`===${reason}===`);
205             log(`===USER=== ${input}`);
206             allowAgentAudio = true;
207             onUserTurn(input, reason);
208             finalTranscript = "";
209             interimTranscript = "";
210             transcriptSeparator = "";
211             smartTurnComplete = false;
212             acceptingTranscript = false;
213             lastFinalConfidence = 1;
214             replaceableShortFinal = false;
215             shortExtensionApplied = false;
216             signalVersion += 1;
217             clearTimers();
218             return true;
219         };
220 
221         const startHardTimeout = (version, delay = policy.userSpeechTimeoutMs) => {
222             clearTimers();
223             fallbackTimer = setTimeout(() => {
224                 if (version !== signalVersion) return;
225 
226                 const input = buildInput();
227                 if (!input) return;
228 
229                 submitCurrentTurn("FALLBACK_END_OF_TURN");
230             }, delay);
231         };
232 
233         // Connector information and error events are part of the module's core
234         // contract, so log them here instead of making every consuming scenario
235         // re-register the same listeners.
236         [
237             Silero.VADEvents.ConnectorInformation,
238             Silero.VADEvents.Error,
239         ].forEach((eventName) => {
240             vad.addEventListener(eventName, (event) => emitModuleEvent(eventName, event));
241         });
242 
243         [
244             Pipecat.TurnEvents.ConnectorInformation,
245             Pipecat.TurnEvents.Error,
246         ].forEach((eventName) => {
247             turnDetector.addEventListener(eventName, (event) =>
248                 emitModuleEvent(eventName, event)
249             );
250         });
251 
252         stt.addEventListener(ASREvents.InterimResult, (event) => {
253             if (!acceptingTranscript) return;
254             const text = event?.text?.trim();
255             if (!text) return;
256 
257             if (!transcriptSeparator && finalTranscript) transcriptSeparator = " ";
258             interimTranscript = text;
259         });
260 
261         stt.addEventListener(ASREvents.Result, (event) => {
262             if (!acceptingTranscript) return;
263             const text = event?.text?.trim();
264             if (!text) return;
265             const confidence = normalizeConfidence(event?.confidence);
266             const hadCommittedPrefix = !!finalTranscript;
267 
268             // A short low-confidence fragment like "they" or "so" is often an
269             // early clipped piece of a longer utterance. Keep it replaceable so
270             // the next final STT chunk can overwrite it. Also keep short
271             // trailing chunks replaceable when they arrive after an existing
272             // transcript prefix, which helps prevent submits like
273             // "do they support open" before the final "AI" lands.
274             if (replaceableShortFinal) {
275                 finalTranscript = text;
276             } else {
277                 if (finalTranscript) finalTranscript += transcriptSeparator || " ";
278                 finalTranscript += text;
279             }
280 
281             interimTranscript = "";
282             transcriptSeparator = " ";
283             lastFinalConfidence = confidence === null ? 1 : confidence;
284             replaceableShortFinal =
285                 isShortUtterance(text) &&
286                 (
287                     hadCommittedPrefix ||
288                     lastFinalConfidence < policy.lowConfidenceShortUtteranceThreshold ||
289                     startsWithContinuationToken(text)
290                 );
291             shortExtensionApplied = false;
292 
293             log(`===STT Final: ${event.text}`);
294             if (isShortUtterance(text) && !replaceableShortFinal && !smartTurnComplete) {
295                 startHardTimeout(
296                     signalVersion,
297                     Math.min(
298                         policy.userSpeechTimeoutMs,
299                         policy.fastShortUtteranceTimeoutMs
300                     )
301                 );
302             }
303             if (smartTurnComplete) submitCurrentTurn("TURN_DETECT: FINAL_TRANSCRIPT");
304         });
305 
306         vad.addEventListener(Silero.VADEvents.Result, (event) => {
307             if (event.speechStartAt) {
308                 signalVersion += 1;
309                 clearTimers();
310                 smartTurnComplete = false;
311                 acceptingTranscript = true;
312                 allowAgentAudio = false;
313                 if (finalTranscript || interimTranscript) transcriptSeparator = " ... ";
314                 log("===BARGE-IN===");
315                 if (onInterrupt) onInterrupt();
316             }
317 
318             if (event.speechEndAt) {
319                 startHardTimeout(signalVersion);
320                 turnDetector.predict();
321             }
322         });
323 
324         turnDetector.addEventListener(Pipecat.TurnEvents.Result, (event) => {
325             log(
326                 `===Pipecat.TurnEvents.Result=== ${JSON.stringify(event.probability)}`
327             );
328             if (!event.endOfTurn) return;
329 
330             smartTurnComplete = true;
331             if (finalTranscript) {
332                 submitCurrentTurn("TURN_DETECT: END_OF_TURN");
333                 return;
334             }
335 
336             if (settleTimer) clearTimeout(settleTimer);
337             const version = signalVersion;
338             settleTimer = setTimeout(() => {
339                 if (version !== signalVersion) return;
340                 submitCurrentTurn("TURN_DETECT: ASR_GRACE");
341             }, policy.transcriptSettleMs);
342         });
343 
344         return {
345             vad,
346             turnDetector,
347             canPlayAgentAudio() {
348                 return allowAgentAudio;
349             },
350             close() {
351                 clearTimers();
352                 vad?.close();
353                 turnDetector?.close();
354             },
355         };
356     },
357 };

The current full-cascade example that consumes this helper is:

Example: Full-cascade incl. Groq