Example: Speech-to-speech translation

View as Markdown

For the complete documentation index, see llms.txt.

This example answers an inbound English call, dials a Spanish-speaking callee, and uses Gemini Live API to translate the caller’s speech into Spanish audio in real time.

⬇️ Jump to the Full VoxEngine scenario.

Gemini 3.1 Flash Live Preview

This page reflects the current gemini-3.1-flash-live-preview flow from Google’s Live API docs: https://ai.google.dev/gemini-api/docs/models/gemini-3.1-flash-live-preview

Prerequisites

Demo video

Video link: Gemini Live speech-to-speech translation demo

Session setup

The Gemini Live API session is configured via connectConfig, passed into Gemini.createLiveAPIClient(...).

In the full scenario, see GEMINI_CONNECT_CONFIG:

  • responseModalities: ["AUDIO"] asks Gemini to speak back in real time.
  • thinkingConfig: { thinkingLevel: "minimal" } reduces latency.
  • realtimeInputConfig.automaticActivityDetection tunes barge-in behavior.
  • speechConfig selects a prebuilt voice for the translated audio.
  • systemInstruction enforces the English → Spanish translation behavior.
  • inputAudioTranscription and outputAudioTranscription are enabled so you can log translated text during the session.

Translation pipeline (one-way)

This example uses a one-way pipeline:

English caller -> Gemini Live API -> Spanish callee

The code wires the audio like this:

Connect audio
1call.sendMediaTo(geminiLiveAPIClient);
2geminiLiveAPIClient.sendMediaTo(calleeCall);

Barge-in

Gemini includes an interrupted flag in ServerContent when the caller speaks over TTS. The example clears the media buffer so Gemini stops speaking immediately:

Barge-in handling
1if (payload.interrupted !== undefined) {
2 geminiLiveAPIClient.clearMediaBuffer();
3}

Events

The scenario listens for Gemini.LiveAPIEvents.ServerContent. If transcripts are enabled, the example logs both languages:

Transcripts
1if (payload.inputTranscription?.text) Logger.write(`===EN=== ${payload.inputTranscription.text}`);
2if (payload.outputTranscription?.text) Logger.write(`===ES=== ${payload.outputTranscription.text}`);

For illustration, it also logs these events:

  • Gemini.LiveAPIEvents: SetupComplete, ServerContent, ConnectorInformation, Unknown
  • Gemini.Events: WebSocketMediaStarted, WebSocketMediaEnded

Notes

  • This example uses the Gemini Developer API (Gemini.Backend.GEMINI_API).
  • The current sample uses gemini-3.1-flash-live-preview.
  • Translation is one-way (English → Spanish). For bidirectional translation, run two Gemini sessions with opposite instructions.
  • The example includes short prompts (call.say / calleeCall.say) to make recordings easier to follow. Remove them for production.
Gemini 2.5 compatibility

If you are updating an older 2.5 translation sample, replace thinkingBudget with thinkingLevel. For 3.1, this example also sends a short sendRealtimeInput(...) startup instruction on SetupComplete so the live interpretation session begins reliably.

See the VoxEngine API Reference for more details.

Full VoxEngine scenario

voxeengine-gemini-s2s-translate.js
1/**
2 * Voximplant + Gemini Live API connector demo
3 * Scenario: real-time speech-to-speech translation (English -> Spanish).
4 */
5
6require(Modules.Gemini);
7require(Modules.ApplicationStorage);
8
9const SYSTEM_INSTRUCTIONS = `
10You are a REAL-TIME INTERPRETER.
11
12Task:
13- Translate everything you hear from English to Spanish.
14
15Rules:
16- Output ONLY the Spanish translation (no English, no explanations, no extra commentary).
17- Preserve meaning, tone, names, numbers, and proper nouns.
18- Keep latency low: translate phrase-by-phrase as soon as you have enough context.
19- Do NOT greet or introduce yourself. Speak ONLY when the caller speaks.
20`;
21
22const GEMINI_MODEL = "gemini-3.1-flash-live-preview";
23
24VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
25 let voiceAIClient;
26 let calleeCall;
27 let terminated = false;
28
29 const terminate = () => {
30 if (terminated) return;
31 terminated = true;
32 calleeCall?.hangup();
33 call?.hangup();
34 VoxEngine.terminate();
35 };
36
37 call.answer();
38 call.record({hd_audio: true, stereo: true});
39 call.addEventListener(CallEvents.Disconnected, terminate);
40 call.addEventListener(CallEvents.Failed, terminate);
41
42 const geminiApiKey = (await ApplicationStorage.get("GEMINI_API_KEY")).value;
43
44 const calleeDestination = (await ApplicationStorage.get("CALLEE_DESTINATION")).value;
45 calleeCall = VoxEngine.callPSTN(calleeDestination, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
46 // Or call via an app user, SIP, or WhatsApp by uncommenting one of the lines below and commenting out the line above.
47 // calleeCall = VoxEngine.callUser(calleeDestination);
48 // calleeCall = VoxEngine.callSIP(`sip:${calleeDestination}@your-sip-domain`, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
49 // calleeCall = VoxEngine.callWhatsappUser({number: calleeDestination, callerid: (await ApplicationStorage.get("PSTN_CALLER_ID")).value)});
50 calleeCall.addEventListener(CallEvents.Disconnected, terminate);
51 calleeCall.addEventListener(CallEvents.Failed, terminate);
52
53 calleeCall.addEventListener(CallEvents.Connected, async () => {
54 calleeCall.record({hd_audio: true, stereo: true});
55
56 // Optional prompts to make the demo obvious on recordings.
57 call.say("Connected. Speak in English. The other party will hear Spanish.");
58 calleeCall.say("Connected. You will hear Spanish translation in real time.");
59
60 const GEMINI_CONNECT_CONFIG = {
61 responseModalities: ["AUDIO"],
62 thinkingConfig: {thinkingLevel: "minimal"},
63 realtimeInputConfig: {
64 automaticActivityDetection: {
65 disabled: false,
66 prefixPaddingMs: 20,
67 silenceDurationMs: 200,
68 },
69 },
70 speechConfig: {
71 voiceConfig: {
72 prebuiltVoiceConfig: {voiceName: "Achird"},
73 },
74 },
75 inputAudioTranscription: {},
76 outputAudioTranscription: {},
77 systemInstruction: {
78 parts: [{text: SYSTEM_INSTRUCTIONS}],
79 },
80 };
81
82 try {
83 voiceAIClient = await Gemini.createLiveAPIClient({
84 apiKey: geminiApiKey,
85 model: GEMINI_MODEL,
86 backend: Gemini.Backend.GEMINI_API,
87 connectConfig: GEMINI_CONNECT_CONFIG,
88 onWebSocketClose: (event) => {
89 Logger.write("===Gemini.WebSocket.Close===");
90 if (event) Logger.write(JSON.stringify(event));
91 terminate();
92 },
93 });
94
95 // Caller (English) -> Gemini -> Callee (Spanish)
96 call.sendMediaTo(voiceAIClient);
97 voiceAIClient.sendMediaTo(calleeCall);
98
99 voiceAIClient.addEventListener(Gemini.LiveAPIEvents.SetupComplete, (event) => {
100 Logger.write("===Gemini.LiveAPIEvents.SetupComplete===");
101 if (event?.data) Logger.write(JSON.stringify(event.data));
102 voiceAIClient.sendRealtimeInput({
103 text: "Start real-time English to Spanish interpretation now. Do not greet. Only translate the caller's speech for the other party.",
104 });
105 });
106
107 voiceAIClient.addEventListener(Gemini.LiveAPIEvents.ServerContent, (event) => {
108 const payload = event?.data?.payload || {};
109 if (payload.inputTranscription?.text) {
110 Logger.write(`===EN=== ${payload.inputTranscription.text}`);
111 }
112 if (payload.outputTranscription?.text) {
113 Logger.write(`===ES=== ${payload.outputTranscription.text}`);
114 }
115 if (payload.interrupted !== undefined) {
116 Logger.write("===BARGE-IN=== Gemini.LiveAPIEvents.ServerContent");
117 voiceAIClient.clearMediaBuffer();
118 }
119 });
120
121 [
122 Gemini.LiveAPIEvents.SetupComplete,
123 Gemini.LiveAPIEvents.ServerContent,
124 Gemini.LiveAPIEvents.ConnectorInformation,
125 Gemini.LiveAPIEvents.Unknown,
126 Gemini.Events.WebSocketMediaStarted,
127 Gemini.Events.WebSocketMediaEnded,
128 ].forEach((eventName) => {
129 voiceAIClient.addEventListener(eventName, (event) => {
130 Logger.write(`===${event.name}===`);
131 if (event?.data) Logger.write(JSON.stringify(event.data));
132 });
133 });
134 } catch (error) {
135 Logger.write("===SOMETHING_WENT_WRONG===");
136 Logger.write(error);
137 terminate();
138 }
139 });
140});