Example: Speech-to-speech translation

For the complete documentation index, see llms.txt.

This example answers an inbound English call, dials a Spanish-speaking callee, and uses Gemini Live API to translate the caller’s speech into Spanish audio in real time.

⬇️ Jump to the Full VoxEngine scenario.

Gemini 3.1 Flash Live Preview

This page reflects the current gemini-3.1-flash-live-preview flow from Google’s Live API docs: https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-live-preview

Prerequisites

Set up an inbound entrypoint for the caller:
- Phone number: https://voximplant.com/docs/getting-started/basic-concepts/phone-numbers
- WhatsApp: https://voximplant.com/docs/guides/integrations/whatsapp
- SIP user / SIP registration: https://voximplant.com/docs/guides/calls/sip
- App user: https://voximplant.com/docs/getting-started/basic-concepts/users (see also https://voximplant.com/docs/guides/calls/scenarios#how-to-call-a-voximplant-user)
Create a routing rule that points the destination (phone number / WhatsApp / SIP username / app user alias) to this scenario: https://voximplant.com/docs/getting-started/basic-concepts/routing-rules
Store the following values in Voximplant ApplicationStorage:
- GEMINI_API_KEY
- CALLEE_DESTINATION (Spanish-speaking callee, e.g. +34911222333)
- PSTN_CALLER_ID (verified caller ID / purchased Voximplant number)

Demo video

Video link: Gemini Live speech-to-speech translation demo

Session setup

The Gemini Live API session is configured via connectConfig, passed into Gemini.createLiveAPIClient(...).

In the full scenario, see GEMINI_CONNECT_CONFIG:

responseModalities: ["AUDIO"] asks Gemini to speak back in real time.
thinkingConfig: { thinkingLevel: "minimal" } reduces latency.
realtimeInputConfig.automaticActivityDetection tunes barge-in behavior.
speechConfig selects a prebuilt voice for the translated audio.
systemInstruction enforces the English → Spanish translation behavior.
inputAudioTranscription and outputAudioTranscription are enabled so you can log translated text during the session.

Translation pipeline (one-way)

This example uses a one-way pipeline:

English caller -> Gemini Live API -> Spanish callee

The code wires the audio like this:

Connect audio

1 call.sendMediaTo(geminiLiveAPIClient);
2 geminiLiveAPIClient.sendMediaTo(calleeCall);

Barge-in

Gemini includes an interrupted flag in ServerContent when the caller speaks over TTS. The example clears the media buffer so Gemini stops speaking immediately:

Barge-in handling

1 if (payload.interrupted !== undefined) {
2   geminiLiveAPIClient.clearMediaBuffer();
3 }

Events

The scenario listens for Gemini.LiveAPIEvents.ServerContent. If transcripts are enabled, the example logs both languages:

Transcripts

1 if (payload.inputTranscription?.text) Logger.write(`===EN=== ${payload.inputTranscription.text}`);
2 if (payload.outputTranscription?.text) Logger.write(`===ES=== ${payload.outputTranscription.text}`);

For illustration, it also logs these events:

Gemini.LiveAPIEvents: SetupComplete, ServerContent, ConnectorInformation, Unknown
Gemini.Events: WebSocketMediaStarted, WebSocketMediaEnded

Notes

This example uses the Gemini Developer API (Gemini.Backend.GEMINI_API).
The current sample uses gemini-3.1-flash-live-preview.
Translation is one-way (English → Spanish). For bidirectional translation, run two Gemini sessions with opposite instructions.
The example includes short prompts (call.say / calleeCall.say) to make recordings easier to follow. Remove them for production.

Gemini 2.5 compatibility

If you are updating an older 2.5 translation sample, replace thinkingBudget with thinkingLevel. For 3.1, this example also sends a short sendRealtimeInput(...) startup instruction on SetupComplete so the live interpretation session begins reliably.

See the VoxEngine API Reference for more details.

Full VoxEngine scenario

voxeengine-gemini-s2s-translate.js

1 /**
2  * Voximplant + Gemini Live API connector demo
3  * Scenario: real-time speech-to-speech translation (English -> Spanish).
4  */
5 
6 require(Modules.Gemini);
7 require(Modules.ApplicationStorage);
8 
9 const SYSTEM_INSTRUCTIONS = `
10 You are a REAL-TIME INTERPRETER.
11 
12 Task:
13 - Translate everything you hear from English to Spanish.
14 
15 Rules:
16 - Output ONLY the Spanish translation (no English, no explanations, no extra commentary).
17 - Preserve meaning, tone, names, numbers, and proper nouns.
18 - Keep latency low: translate phrase-by-phrase as soon as you have enough context.
19 - Do NOT greet or introduce yourself. Speak ONLY when the caller speaks.
20 `;
21 
22 const GEMINI_MODEL = "gemini-3.1-flash-live-preview";
23 
24 VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
25     let voiceAIClient;
26     let calleeCall;
27     let terminated = false;
28 
29     const terminate = () => {
30         if (terminated) return;
31         terminated = true;
32         calleeCall?.hangup();
33         call?.hangup();
34         VoxEngine.terminate();
35     };
36 
37     call.answer();
38     call.record({hd_audio: true, stereo: true});
39     call.addEventListener(CallEvents.Disconnected, terminate);
40     call.addEventListener(CallEvents.Failed, terminate);
41 
42     const geminiApiKey = (await ApplicationStorage.get("GEMINI_API_KEY")).value;
43 
44     const calleeDestination = (await ApplicationStorage.get("CALLEE_DESTINATION")).value;
45     calleeCall = VoxEngine.callPSTN(calleeDestination, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
46     // Or call via an app user, SIP, or WhatsApp by uncommenting one of the lines below and commenting out the line above.
47     // calleeCall = VoxEngine.callUser(calleeDestination);
48     // calleeCall = VoxEngine.callSIP(`sip:${calleeDestination}@your-sip-domain`, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
49     // calleeCall = VoxEngine.callWhatsappUser({number: calleeDestination, callerid: (await ApplicationStorage.get("PSTN_CALLER_ID")).value)});
50     calleeCall.addEventListener(CallEvents.Disconnected, terminate);
51     calleeCall.addEventListener(CallEvents.Failed, terminate);
52 
53     calleeCall.addEventListener(CallEvents.Connected, async () => {
54         calleeCall.record({hd_audio: true, stereo: true});
55 
56         // Optional prompts to make the demo obvious on recordings.
57         call.say("Connected. Speak in English. The other party will hear Spanish.");
58         calleeCall.say("Connected. You will hear Spanish translation in real time.");
59 
60         const GEMINI_CONNECT_CONFIG = {
61             responseModalities: ["AUDIO"],
62             thinkingConfig: {thinkingLevel: "minimal"},
63             realtimeInputConfig: {
64                 automaticActivityDetection: {
65                     disabled: false,
66                     prefixPaddingMs: 20,
67                     silenceDurationMs: 200,
68                 },
69             },
70             speechConfig: {
71                 voiceConfig: {
72                     prebuiltVoiceConfig: {voiceName: "Achird"},
73                 },
74             },
75             inputAudioTranscription: {},
76             outputAudioTranscription: {},
77             systemInstruction: {
78                 parts: [{text: SYSTEM_INSTRUCTIONS}],
79             },
80         };
81 
82         try {
83             voiceAIClient = await Gemini.createLiveAPIClient({
84                 apiKey: geminiApiKey,
85                 model: GEMINI_MODEL,
86                 backend: Gemini.Backend.GEMINI_API,
87                 connectConfig: GEMINI_CONNECT_CONFIG,
88                 onWebSocketClose: (event) => {
89                     Logger.write("===Gemini.WebSocket.Close===");
90                     if (event) Logger.write(JSON.stringify(event));
91                     terminate();
92                 },
93             });
94 
95             // Caller (English) -> Gemini -> Callee (Spanish)
96             call.sendMediaTo(voiceAIClient);
97             voiceAIClient.sendMediaTo(calleeCall);
98 
99             voiceAIClient.addEventListener(Gemini.LiveAPIEvents.SetupComplete, (event) => {
100                 Logger.write("===Gemini.LiveAPIEvents.SetupComplete===");
101                 if (event?.data) Logger.write(JSON.stringify(event.data));
102                 voiceAIClient.sendRealtimeInput({
103                     text: "Start real-time English to Spanish interpretation now. Do not greet. Only translate the caller's speech for the other party.",
104                 });
105             });
106 
107             voiceAIClient.addEventListener(Gemini.LiveAPIEvents.ServerContent, (event) => {
108                 const payload = event?.data?.payload || {};
109                 if (payload.inputTranscription?.text) {
110                     Logger.write(`===EN=== ${payload.inputTranscription.text}`);
111                 }
112                 if (payload.outputTranscription?.text) {
113                     Logger.write(`===ES=== ${payload.outputTranscription.text}`);
114                 }
115                 if (payload.interrupted !== undefined) {
116                     Logger.write("===BARGE-IN=== Gemini.LiveAPIEvents.ServerContent");
117                     voiceAIClient.clearMediaBuffer();
118                 }
119             });
120 
121             [
122                 Gemini.LiveAPIEvents.SetupComplete,
123                 Gemini.LiveAPIEvents.ServerContent,
124                 Gemini.LiveAPIEvents.ConnectorInformation,
125                 Gemini.LiveAPIEvents.Unknown,
126                 Gemini.Events.WebSocketMediaStarted,
127                 Gemini.Events.WebSocketMediaEnded,
128             ].forEach((eventName) => {
129                 voiceAIClient.addEventListener(eventName, (event) => {
130                     Logger.write(`===${event.name}===`);
131                     if (event?.data) Logger.write(JSON.stringify(event.data));
132                 });
133             });
134         } catch (error) {
135             Logger.write("===SOMETHING_WENT_WRONG===");
136             Logger.write(error);
137             terminate();
138         }
139     });
140 });