Example: Answering an incoming call

View as Markdown

For the complete documentation index, see llms.txt.

This example answers an inbound Voximplant call and bridges audio to Inworld Realtime API for a live speech-to-speech demo. The scenario configures Inworld STT, Claude Sonnet, expressive Inworld TTS2 voice output, automatic turn responses, backchannels, responsiveness behavior, and barge-in. Video link: Inbound SIP demo video coming soon.

Jump to the Full VoxEngine scenario.

Prerequisites

Session setup

Create an Inworld.RealtimeAPIClient with your API key from VoxEngine Secrets and a unique session key. The session key is an Inworld session identifier; the example generates a new string per call.

Create Inworld client
1voiceAIClient = await Inworld.createRealtimeAPIClient({
2 apiKey: VoxEngine.getSecretValue("INWORLD_API_KEY"),
3 sessionKey: `inworld-realtime-demo-${Date.now()}`,
4 onWebSocketClose: terminate,
5});

After Inworld emits SessionCreated, send a session.update payload that configures the model, prompt, output modalities, speech recognition, turn detection, voice output, and Inworld-specific provider features. The Inworld-specific settings are passed in the providerData block inside the same session config:

Session setup
1voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionCreated, () => {
2 voiceAIClient.sessionUpdate({
3 session: {
4 type: "realtime",
5 model: "claude-sonnet-4-6",
6 instructions: SYSTEM_PROMPT,
7 output_modalities: ["audio", "text"],
8 audio: {
9 input: {
10 transcription: {
11 model: "inworld/inworld-stt-1",
12 language: "en",
13 prompt: "Important terms: Voximplant, VoxEngine, Inworld, Inworld Realtime, TTS2, SIP, WhatsApp Business Calling.",
14 },
15 turn_detection: {
16 type: "semantic_vad",
17 eagerness: "high",
18 create_response: true,
19 interrupt_response: true,
20 },
21 },
22 output: { voice: "Ashley", model: "inworld-tts-2", speed: 1.0 },
23 },
24 providerData: {
25 tts: {
26 delivery_mode: "BALANCED",
27 segmenter_strategy: "full_turn",
28 steering_handling: "emit_once",
29 },
30 stt: { voice_profile: true },
31 backchannel: {
32 enabled: true,
33 eval_interval_ms: 800,
34 min_speech_ms: 800,
35 min_gap_ms: 4000,
36 max_per_turn: 3,
37 volume_gain: 0.6,
38 },
39 responsiveness: {
40 enabled: true,
41 initial_wait_timeout_ms: 500,
42 hard_deadline_ms: 1200,
43 max_tokens: 8,
44 min_filler_gap_ms: 5000,
45 max_initial_per_turn: 1,
46 enable_filler_on_first_assistant_reply: false,
47 },
48 },
49 },
50 });
51});

In this config, standard Realtime API fields set the voice model, STT model, and turn detection. The nested providerData block adds Inworld-specific behavior: delivery_mode: "BALANCED" for natural call-center speech, segmenter_strategy: "full_turn" for coherent spoken turns, backchannels for brief active-listening cues, and responsiveness settings to reduce dead air when the model needs more time.

For deeper tuning details, see Inworld’s docs for Realtime API Extensions, Adding Naturalness, Back-channel responses, and Responsiveness.

Connect call audio and greet

Once Inworld emits SessionUpdated, bridge audio both ways between the call and Inworld. Then create a short text conversation item and call responseCreate. This explicit seed item gives Inworld a normal user turn to answer, so the agent greets automatically before the caller speaks.

Connect audio and trigger greeting
1VoxEngine.sendMediaBetween(call, voiceAIClient);
2voiceAIClient.conversationItemCreate({
3 item: {
4 type: "message",
5 role: "user",
6 content: [
7 {
8 type: "input_text",
9 text: "The phone call just connected. Say only: Hi, this is Voxi from Voximplant. How can I help?",
10 },
11 ],
12 },
13});
14voiceAIClient.responseCreate({
15 response: {
16 output_modalities: ["audio", "text"],
17 },
18});

Barge-in

When the caller starts speaking, clear any queued Inworld output audio so the caller can interrupt the agent naturally.

Barge-in
1voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStarted, () => {
2 voiceAIClient.outputAudioBufferClear({});
3});

Notes

  • INWORLD_API_KEY is read from Voximplant Secrets.
  • sessionKey can be any unique string to maintain context for the Inworld session.
  • create_response: true lets Inworld respond automatically after caller turns.
  • The initial greeting is triggered manually with conversationItemCreate followed by responseCreate.
  • See the Inworld VoxEngine API reference for more details.

Full VoxEngine scenario

voxeengine-inworld-answer-incoming-call.js
1/**
2 * Voximplant + Inworld Realtime API demo
3 * Scenario: answer an incoming call and bridge it to Inworld Realtime.
4 */
5
6require(Modules.Inworld);
7
8const SYSTEM_PROMPT = `
9You are Voxi, a Voximplant developer advocate on a live phone call.
10Voximplant is pronounced VOX-im-plant.
11Keep answers short, natural, and useful for a marketing demo.
12Default to one short sentence under 12 words.
13
14If the caller asks "Tell me about Voximplant and Inworld" or any similar broad
15question, answer directly: Voximplant brings Inworld Realtime agents to real
16communication channels, while Inworld provides expressive realtime voice,
17conversation-aware delivery, TTS2, persona control, and voice direction.
18
19Demo goal:
20- Explain how Voximplant brings Inworld Realtime voice agents to real communications channels.
21- Highlight Inworld expressive realtime speech, TTS2-style delivery, conversation awareness, voice direction, and persona control.
22- Highlight Voximplant calling, SIP, WhatsApp Business Calling, browser calls, native app calls, and VoxEngine orchestration.
23- Mention that teams can use Inworld without building a custom media gateway.
24- Mention production telephony features when relevant: transfers, DTMF, barge-in, no-audio detection, debugging, and monitoring.
25
26Voice style:
27- Sound like an expressive product expert, not a flat IVR.
28- Adapt your tone to the caller's question.
29- Use short, human turns. Default to one short spoken sentence.
30- Use small spoken disfluencies sparingly: "uh", "hmm", "well", "right", "okay".
31- Use at most one TTS-2 non-verbal tag per turn, and often none: [laugh], [breathe], [sigh], [clear throat].
32- Use at most one [speak ...] steering tag per turn. If used, it must be first.
33- If the caller asks about emotion, persona, pacing, or voice control, demonstrate it briefly with a [speak ...] tag or a natural non-verbal cue.
34- If the caller asks what this demo is, say this is Voximplant connected to Inworld Realtime on a live phone call.
35
36If the caller asks about a product detail you are not sure of, say you would check the latest docs instead of guessing.
37`;
38
39const SESSION_CONFIG = {
40 session: {
41 type: "realtime", // Realtime mode keeps the call audio stream connected to Inworld.
42 model: "claude-sonnet-4-6", // Note: some models require a payment method on file
43 instructions: SYSTEM_PROMPT,
44 output_modalities: ["audio", "text"], // Audio is required for automatic turn-detection responses
45 audio: {
46 input: {
47 // Inworld STT keeps the transcript inside the same realtime session.
48 transcription: {
49 model: "inworld/inworld-stt-1",
50 language: "en",
51 prompt: "Important terms: Voximplant, VoxEngine, Inworld, Inworld Realtime, TTS2, SIP, WhatsApp Business Calling.",
52 },
53 // Semantic VAD decides when the caller's turn is complete and can auto-create the response.
54 turn_detection: {
55 type: "semantic_vad",
56 eagerness: "high",
57 create_response: true,
58 interrupt_response: true,
59 },
60 },
61 output: {
62 // TTS-2 voice output. This is where the demo showcases expressive delivery.
63 voice: "Ashley",
64 model: "inworld-tts-2",
65 speed: 1.0,
66 },
67 },
68 providerData: {
69 tts: {
70 delivery_mode: "BALANCED", // Naturalness: steady call-center prosody with TTS-2 control tags.
71 segmenter_strategy: "full_turn", // Naturalness: wait for the full turn before speech output.
72 steering_handling: "emit_once", // Apply [speak ...] steering once per response
73 },
74 stt: {
75 // Helps Inworld adapt recognition to the current speaker during the session.
76 voice_profile: true,
77 },
78 backchannel: {
79 // Back-channel can produce short acknowledgements while the caller is still speaking.
80 enabled: true,
81 eval_interval_ms: 800,
82 min_speech_ms: 800,
83 min_gap_ms: 4000,
84 max_per_turn: 3,
85 volume_gain: 0.6,
86 },
87 responsiveness: {
88 // Responsiveness allows brief filler behavior if generation is slower than expected.
89 enabled: true,
90 initial_wait_timeout_ms: 500,
91 hard_deadline_ms: 1200,
92 max_tokens: 8,
93 min_filler_gap_ms: 5000,
94 max_initial_per_turn: 1,
95 enable_filler_on_first_assistant_reply: false,
96 },
97 },
98 },
99};
100
101VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
102 let voiceAIClient;
103
104 // Helper to clean-up the call when done
105 const terminate = (event) => {
106 if (event) Logger.write(JSON.stringify(event));
107 voiceAIClient?.close();
108 VoxEngine.terminate();
109 };
110
111 // Termination handlers
112 call.addEventListener(CallEvents.Disconnected, terminate);
113 call.addEventListener(CallEvents.Failed, terminate);
114
115 try {
116 call.answer();
117 // Optional: record & transcribe the call. Note: use `require(Modules.ASR)` needed for enums
118 // call.record({hd_audio: true, stereo: true, transcribe: true, provider: TranscriptionProvider.GOOGLE, language: ASRLanguage.ENGLISH_US });
119
120 // Create client and connect to Inworld Realtime.
121 voiceAIClient = await Inworld.createRealtimeAPIClient({
122 apiKey: VoxEngine.getSecretValue("INWORLD_API_KEY"),
123 sessionKey: `inworld-realtime-demo-${Date.now()}`,
124 onWebSocketClose: terminate,
125 });
126
127 voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionCreated, () => {
128 Logger.write("===Inworld.SessionCreated===");
129 voiceAIClient.sessionUpdate(SESSION_CONFIG);
130 });
131
132 // Once the session is configured, bridge call media and trigger the greeting.
133 voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.SessionUpdated, () => {
134 Logger.write("===Inworld.SessionUpdated===");
135 VoxEngine.sendMediaBetween(call, voiceAIClient);
136 voiceAIClient.conversationItemCreate({
137 item: {
138 type: "message",
139 role: "user",
140 content: [
141 {
142 type: "input_text",
143 text: `The phone call just connected.
144 Greet the caller as Voxi from Voximplant and invite a question about Voximplant and Inworld.`,
145 },
146 ],
147 },
148 });
149 voiceAIClient.responseCreate({
150 response: {
151 output_modalities: ["audio", "text"],
152 },
153 });
154 });
155
156 // Barge-in: clear buffered output audio when the caller starts speaking.
157 voiceAIClient.addEventListener(Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStarted, () => {
158 Logger.write("===BARGE-IN: Inworld.InputAudioBufferSpeechStarted===");
159 voiceAIClient.outputAudioBufferClear({});
160 });
161
162 // Caller transcript for debugging.
163 voiceAIClient.addEventListener(
164 Inworld.RealtimeAPIEvents.ConversationItemInputAudioTranscriptionCompleted,
165 (event) => {
166 const payload = event?.data?.payload || event?.data || {};
167 const transcript = payload.transcript || payload.text || payload.delta;
168 if (transcript) Logger.write(`===USER==> ${transcript}`);
169 },
170 );
171
172 // Final agent transcript for readable call logs.
173 voiceAIClient.addEventListener(
174 Inworld.RealtimeAPIEvents.ResponseOutputAudioTranscriptDone,
175 (event) => {
176 const payload = event?.data?.payload || event?.data || {};
177 const transcript = payload.transcript || payload.text;
178 if (transcript) Logger.write(`===AGENT==> ${transcript}`);
179 },
180 );
181
182 // Consolidated log-only handlers for lifecycle, audio, and error debugging.
183 [
184 Inworld.RealtimeAPIEvents.ConversationItemInputAudioTranscriptionDelta,
185 Inworld.RealtimeAPIEvents.ResponseCreated,
186 Inworld.RealtimeAPIEvents.ResponseDone,
187 Inworld.RealtimeAPIEvents.ResponseOutputAudioDone,
188 Inworld.RealtimeAPIEvents.InputAudioBufferSpeechStopped,
189 Inworld.RealtimeAPIEvents.InputAudioBufferCommitted,
190 Inworld.RealtimeAPIEvents.InputAudioBufferCleared,
191 Inworld.RealtimeAPIEvents.OutputAudioBufferStarted,
192 Inworld.RealtimeAPIEvents.OutputAudioBufferStopped,
193 Inworld.RealtimeAPIEvents.OutputAudioBufferCleared,
194 Inworld.RealtimeAPIEvents.ConnectorInformation,
195 // Inworld.RealtimeAPIEvents.ResponseOutputAudioTranscriptDelta, // this is noisy
196 Inworld.RealtimeAPIEvents.HTTPResponse,
197 Inworld.RealtimeAPIEvents.Error,
198 Inworld.RealtimeAPIEvents.WebSocketError,
199 Inworld.RealtimeAPIEvents.Unknown,
200 Inworld.Events.WebSocketMediaStarted,
201 Inworld.Events.WebSocketMediaEnded,
202 ].forEach((eventName) => {
203 voiceAIClient.addEventListener(eventName, (event) => {
204 Logger.write(`===${event.name}===`);
205 if (event?.data) Logger.write(JSON.stringify(event.data));
206 });
207 });
208
209 } catch (error) {
210 Logger.write("===UNHANDLED_ERROR===");
211 terminate(error instanceof Error ? {message: error.message, stack: error.stack} : {error: String(error)});
212 }
213});