Example: Answering an incoming call

For the complete documentation index, see llms.txt.

This example answers an inbound Voximplant call and bridges audio to Inworld Realtime API for a live speech-to-speech demo. The scenario configures Inworld STT, Claude Sonnet, expressive Inworld TTS2 voice output, automatic turn responses, backchannels, responsiveness behavior, and barge-in. Video link: Inworld Realtime API and Voximplant inbound SIP demo

Jump to the Full VoxEngine scenario.

Prerequisites

Set up an inbound entrypoint for the caller:
Create a routing rule that points the destination to this scenario.
Store your Inworld API key in Voximplant Secrets under INWORLD_API_KEY.

Session setup

Create an Inworld.RealtimeAPIClient with your API key from VoxEngine Secrets and a unique session key. The session key is an Inworld session identifier; the example generates a new string per call.

Create Inworld client

1 voiceAIClient = await Inworld.createRealtimeAPIClient({
2   apiKey: VoxEngine.getSecretValue("INWORLD_API_KEY"),
3   sessionKey: `inworld-realtime-demo-${Date.now()}`,
4   onWebSocketClose: terminate,
5 });

After Inworld emits SessionCreated, send a session.update payload that configures the model, prompt, output modalities, speech recognition, turn detection, voice output, and Inworld-specific provider features. The Inworld-specific settings are passed in the providerData block inside the same session config:

Session setup

1 voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionCreated, () => {
2   voiceAIClient.sessionUpdate({
3     session: {
4       type: "realtime",
5       model: "claude-sonnet-4-6",
6       instructions: SYSTEM_PROMPT,
7       output_modalities: ["audio", "text"],
8       audio: {
9         input: {
10           transcription: {
11             model: "inworld/inworld-stt-1",
12             language: "en",
13             prompt: "Important terms: Voximplant, VoxEngine, Inworld, Inworld Realtime, TTS2, SIP, WhatsApp Business Calling.",
14           },
15           turn_detection: {
16             type: "semantic_vad",
17             eagerness: "high",
18             create_response: true,
19             interrupt_response: true,
20           },
21         },
22         output: { voice: "Ashley", model: "inworld-tts-2", speed: 1.0 },
23       },
24       providerData: {
25         tts: {
26           delivery_mode: "BALANCED",
27           segmenter_strategy: "full_turn",
28           steering_handling: "emit_once",
29         },
30         stt: { voice_profile: true },
31         backchannel: {
32           enabled: true,
33           eval_interval_ms: 800,
34           min_speech_ms: 800,
35           min_gap_ms: 4000,
36           max_per_turn: 3,
37           volume_gain: 0.6,
38         },
39         responsiveness: {
40           enabled: true,
41           initial_wait_timeout_ms: 500,
42           hard_deadline_ms: 1200,
43           max_tokens: 8,
44           min_filler_gap_ms: 5000,
45           max_initial_per_turn: 1,
46           enable_filler_on_first_assistant_reply: false,
47         },
48       },
49     },
50   });
51 });

In this config, standard Realtime API fields set the voice model, STT model, and turn detection. The nested providerData block adds Inworld-specific behavior: delivery_mode: "BALANCED" for natural call-center speech, segmenter_strategy: "full_turn" for coherent spoken turns, backchannels for brief active-listening cues, and responsiveness settings to reduce dead air when the model needs more time.

For deeper tuning details, see Inworld’s docs for Realtime API Extensions, Adding Naturalness, Back-channel responses, and Responsiveness.

Connect call audio and greet

Once Inworld emits SessionUpdated, bridge audio both ways between the call and Inworld. Then create a short text conversation item and call responseCreate. This explicit seed item gives Inworld a normal user turn to answer, so the agent greets automatically before the caller speaks.

Connect audio and trigger greeting

1 VoxEngine.sendMediaBetween(call, voiceAIClient);
2 voiceAIClient.conversationItemCreate({
3   item: {
4     type: "message",
5     role: "user",
6     content: [
7       {
8         type: "input_text",
9         text: "The phone call just connected. Say only: Hi, this is Voxi from Voximplant. How can I help?",
10       },
11     ],
12   },
13 });
14 voiceAIClient.responseCreate({
15   response: {
16     output_modalities: ["audio", "text"],
17   },
18 });

Barge-in

When the caller starts speaking, clear any queued Inworld output audio so the caller can interrupt the agent naturally.

Barge-in

1 voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStarted, () => {
2   voiceAIClient.outputAudioBufferClear({});
3 });

Notes

INWORLD_API_KEY is read from Voximplant Secrets.
sessionKey can be any unique string to maintain context for the Inworld session.
create_response: true lets Inworld respond automatically after caller turns.
The initial greeting is triggered manually with conversationItemCreate followed by responseCreate.
See the Inworld VoxEngine API reference for more details.

Full VoxEngine scenario

voxeengine-inworld-answer-incoming-call.js

1 /**
2  * Voximplant + Inworld Realtime API demo
3  * Scenario: answer an incoming call and bridge it to Inworld Realtime.
4  */
5 
6 require(Modules.Inworld);
7 
8 const SYSTEM_PROMPT = `
9 You are Voxi, a Voximplant developer advocate on a live phone call.
10 Voximplant is pronounced VOX-im-plant.
11 Keep answers short, natural, and useful for a marketing demo.
12 Default to one short sentence under 12 words.
13 
14 If the caller asks "Tell me about Voximplant and Inworld" or any similar broad
15 question, answer directly: Voximplant brings Inworld Realtime agents to real
16 communication channels, while Inworld provides expressive realtime voice,
17 conversation-aware delivery, TTS2, persona control, and voice direction.
18 
19 Demo goal:
20 - Explain how Voximplant brings Inworld Realtime voice agents to real communications channels.
21 - Highlight Inworld expressive realtime speech, TTS2-style delivery, conversation awareness, voice direction, and persona control.
22 - Highlight Voximplant calling, SIP, WhatsApp Business Calling, browser calls, native app calls, and VoxEngine orchestration.
23 - Mention that teams can use Inworld without building a custom media gateway.
24 - Mention production telephony features when relevant: transfers, DTMF, barge-in, no-audio detection, debugging, and monitoring.
25 
26 Voice style:
27 - Sound like an expressive product expert, not a flat IVR.
28 - Adapt your tone to the caller's question.
29 - Use short, human turns. Default to one short spoken sentence.
30 - Use small spoken disfluencies sparingly: "uh", "hmm", "well", "right", "okay".
31 - Use at most one TTS-2 non-verbal tag per turn, and often none: [laugh], [breathe], [sigh], [clear throat].
32 - Use at most one [speak ...] steering tag per turn. If used, it must be first.
33 - If the caller asks about emotion, persona, pacing, or voice control, demonstrate it briefly with a [speak ...] tag or a natural non-verbal cue.
34 - If the caller asks what this demo is, say this is Voximplant connected to Inworld Realtime on a live phone call.
35 
36 If the caller asks about a product detail you are not sure of, say you would check the latest docs instead of guessing.
37 `;
38 
39 const SESSION_CONFIG = {
40     session: {
41         type: "realtime",   // Realtime mode keeps the call audio stream connected to Inworld.
42         model: "claude-sonnet-4-6",     // Note: some models require a payment method on file
43         instructions: SYSTEM_PROMPT,
44         output_modalities: ["audio", "text"],   // Audio is required for automatic turn-detection responses
45         audio: {
46             input: {
47                 // Inworld STT keeps the transcript inside the same realtime session.
48                 transcription: {
49                     model: "inworld/inworld-stt-1",
50                     language: "en",
51                     prompt: "Important terms: Voximplant, VoxEngine, Inworld, Inworld Realtime, TTS2, SIP, WhatsApp Business Calling.",
52                 },
53                 // Semantic VAD decides when the caller's turn is complete and can auto-create the response.
54                 turn_detection: {
55                     type: "semantic_vad",
56                     eagerness: "high",
57                     create_response: true,
58                     interrupt_response: true,
59                 },
60             },
61             output: {
62                 // TTS-2 voice output. This is where the demo showcases expressive delivery.
63                 voice: "Ashley",
64                 model: "inworld-tts-2",
65                 speed: 1.0,
66             },
67         },
68         providerData: {
69             tts: {
70                 delivery_mode: "BALANCED", // Naturalness: steady call-center prosody with TTS-2 control tags.
71                 segmenter_strategy: "full_turn", // Naturalness: wait for the full turn before speech output.
72                 steering_handling: "emit_once",  // Apply [speak ...] steering once per response
73             },
74             stt: {
75                 // Helps Inworld adapt recognition to the current speaker during the session.
76                 voice_profile: true,
77             },
78             backchannel: {
79                 // Back-channel can produce short acknowledgements while the caller is still speaking.
80                 enabled: true,
81                 eval_interval_ms: 800,
82                 min_speech_ms: 800,
83                 min_gap_ms: 4000,
84                 max_per_turn: 3,
85                 volume_gain: 0.6,
86             },
87             responsiveness: {
88                 // Responsiveness allows brief filler behavior if generation is slower than expected.
89                 enabled: true,
90                 initial_wait_timeout_ms: 500,
91                 hard_deadline_ms: 1200,
92                 max_tokens: 8,
93                 min_filler_gap_ms: 5000,
94                 max_initial_per_turn: 1,
95                 enable_filler_on_first_assistant_reply: false,
96             },
97         },
98     },
99 };
100 
101 VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
102     let voiceAIClient;
103 
104     // Helper to clean-up the call when done
105     const terminate = (event) => {
106         if (event) Logger.write(JSON.stringify(event));
107         voiceAIClient?.close();
108         VoxEngine.terminate();
109     };
110 
111     // Termination handlers
112     call.addEventListener(CallEvents.Disconnected, terminate);
113     call.addEventListener(CallEvents.Failed, terminate);
114 
115     try {
116         call.answer();
117         // Optional: record & transcribe the call. Note: use `require(Modules.ASR)` needed for enums
118         // call.record({hd_audio: true, stereo: true, transcribe: true, provider: TranscriptionProvider.GOOGLE, language: ASRLanguage.ENGLISH_US });
119 
120         // Create client and connect to Inworld Realtime.
121         voiceAIClient = await Inworld.createRealtimeAPIClient({
122             apiKey: VoxEngine.getSecretValue("INWORLD_API_KEY"),
123             sessionKey: `inworld-realtime-demo-${Date.now()}`,
124             onWebSocketClose: terminate,
125         });
126 
127         voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionCreated, () => {
128             Logger.write("===Inworld.SessionCreated===");
129             voiceAIClient.sessionUpdate(SESSION_CONFIG);
130         });
131 
132         // Once the session is configured, bridge call media and trigger the greeting.
133         voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionUpdated, () => {
134             Logger.write("===Inworld.SessionUpdated===");
135             VoxEngine.sendMediaBetween(call, voiceAIClient);
136             voiceAIClient.conversationItemCreate({
137                 item: {
138                     type: "message",
139                     role: "user",
140                     content: [
141                         {
142                             type: "input_text",
143                             text: `The phone call just connected.
144                                    Greet the caller as Voxi from Voximplant and invite a question about Voximplant and Inworld.`,
145                         },
146                     ],
147                 },
148             });
149             voiceAIClient.responseCreate({
150                 response: {
151                     output_modalities: ["audio", "text"],
152                 },
153             });
154         });
155 
156         // Barge-in: clear buffered output audio when the caller starts speaking.
157         voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStarted, () => {
158             Logger.write("===BARGE-IN: Inworld.InputAudioBufferSpeechStarted===");
159             voiceAIClient.outputAudioBufferClear({});
160         });
161 
162         // Caller transcript for debugging.
163         voiceAIClient.addEventListener(
164             Inworld.RealtimeAPIEvents.ConversationItemInputAudioTranscriptionCompleted,
165             (event) => {
166                 const payload = event?.data?.payload || event?.data || {};
167                 const transcript = payload.transcript || payload.text || payload.delta;
168                 if (transcript) Logger.write(`===USER==> ${transcript}`);
169             },
170         );
171 
172         // Final agent transcript for readable call logs.
173         voiceAIClient.addEventListener(
174             Inworld.RealtimeAPIEvents.ResponseOutputAudioTranscriptDone,
175             (event) => {
176                 const payload = event?.data?.payload || event?.data || {};
177                 const transcript = payload.transcript || payload.text;
178                 if (transcript) Logger.write(`===AGENT==> ${transcript}`);
179             },
180         );
181 
182         // Consolidated log-only handlers for lifecycle, audio, and error debugging.
183         [
184             Inworld.RealtimeAPIEvents.ConversationItemInputAudioTranscriptionDelta,
185             Inworld.RealtimeAPIEvents.ResponseCreated,
186             Inworld.RealtimeAPIEvents.ResponseDone,
187             Inworld.RealtimeAPIEvents.ResponseOutputAudioDone,
188             Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStopped,
189             Inworld.RealtimeAPIEvents.InputAudioBufferCommitted,
190             Inworld.RealtimeAPIEvents.InputAudioBufferCleared,
191             Inworld.RealtimeAPIEvents.OutputAudioBufferStarted,
192             Inworld.RealtimeAPIEvents.OutputAudioBufferStopped,
193             Inworld.RealtimeAPIEvents.OutputAudioBufferCleared,
194             Inworld.RealtimeAPIEvents.ConnectorInformation,
195             // Inworld.RealtimeAPIEvents.ResponseOutputAudioTranscriptDelta, // this is noisy
196             Inworld.RealtimeAPIEvents.HTTPResponse,
197             Inworld.RealtimeAPIEvents.Error,
198             Inworld.RealtimeAPIEvents.WebSocketError,
199             Inworld.RealtimeAPIEvents.Unknown,
200             Inworld.Events.WebSocketMediaStarted,
201             Inworld.Events.WebSocketMediaEnded,
202         ].forEach((eventName) => {
203             voiceAIClient.addEventListener(eventName, (event) => {
204                 Logger.write(`===${event.name}===`);
205                 if (event?.data) Logger.write(JSON.stringify(event.data));
206             });
207         });
208 
209     } catch (error) {
210         Logger.write("===UNHANDLED_ERROR===");
211         terminate(error instanceof Error ? {message: error.message, stack: error.stack} : {error: String(error)});
212     }
213 });