Example: Speech-to-speech translation

View as Markdown

This example answers an inbound English call, dials a Spanish-speaking callee, and uses Gemini Live API to translate the caller’s speech into Spanish audio in real time.

⬇️ Jump to the Full VoxEngine scenario.

Prerequisites

Demo video

Session setup

The Gemini Live API session is configured via connectConfig, passed into Gemini.createLiveAPIClient(...).

In the full scenario, see GEMINI_CONNECT_CONFIG:

  • responseModalities: ["AUDIO"] asks Gemini to speak back in real time.
  • thinkingConfig: { thinkingBudget: 0 } disables long thinking to reduce latency.
  • realtimeInputConfig.automaticActivityDetection tunes barge-in behavior.
  • speechConfig selects a prebuilt voice for the translated audio.
  • systemInstruction enforces the English → Spanish translation behavior.

To log text transcripts, uncomment inputAudioTranscription and outputAudioTranscription.

Translation pipeline (one-way)

This example uses a one-way pipeline:

English caller -> Gemini Live API -> Spanish callee

The code wires the audio like this:

Connect audio
1call.sendMediaTo(geminiLiveAPIClient);
2geminiLiveAPIClient.sendMediaTo(calleeCall);

Barge-in

Gemini includes an interrupted flag in ServerContent when the caller speaks over TTS. The example clears the media buffer so Gemini stops speaking immediately:

Barge-in handling
1if (payload.interrupted !== undefined) {
2 geminiLiveAPIClient.clearMediaBuffer();
3}

Events

The scenario listens for Gemini.LiveAPIEvents.ServerContent. If transcripts are enabled, the example logs both languages:

Transcripts
1if (payload.inputTranscription?.text) Logger.write(`===EN=== ${payload.inputTranscription.text}`);
2if (payload.outputTranscription?.text) Logger.write(`===ES=== ${payload.outputTranscription.text}`);

For illustration, it also logs these events:

  • Gemini.LiveAPIEvents: SetupComplete, ServerContent, ConnectorInformation, Unknown
  • Gemini.Events: WebSocketMediaStarted, WebSocketMediaEnded

Notes

  • This example uses the Gemini Developer API (Gemini.Backend.GEMINI_API).
  • Translation is one-way (English → Spanish). For bidirectional translation, run two Gemini sessions with opposite instructions.
  • The example includes short prompts (call.say / calleeCall.say) to make recordings easier to follow. Remove them for production.

See the VoxEngine API Reference for more details.

Full VoxEngine scenario

voxeengine-gemini-s2s-translate.js
1/**
2 * Voximplant + Gemini Live API connector demo
3 * Scenario: real-time speech-to-speech translation (English -> Spanish).
4 */
5
6require(Modules.Gemini);
7require(Modules.ApplicationStorage);
8
9const SYSTEM_INSTRUCTIONS = `
10You are a REAL-TIME INTERPRETER.
11
12Task:
13- Translate everything you hear from English to Spanish.
14
15Rules:
16- Output ONLY the Spanish translation (no English, no explanations, no extra commentary).
17- Preserve meaning, tone, names, numbers, and proper nouns.
18- Keep latency low: translate phrase-by-phrase as soon as you have enough context.
19- Do NOT greet or introduce yourself. Speak ONLY when the caller speaks.
20`;
21
22const GEMINI_MODEL = "gemini-2.5-flash-native-audio-preview-12-2025";
23
24VoxEngine.addEventListener(AppEvents.CallAlerting, async ({call}) => {
25 let voiceAIClient;
26 let calleeCall;
27 let terminated = false;
28
29 const terminate = () => {
30 if (terminated) return;
31 terminated = true;
32 calleeCall?.hangup();
33 call?.hangup();
34 VoxEngine.terminate();
35 };
36
37 call.answer();
38 call.record({hd_audio: true, stereo: true});
39 call.addEventListener(CallEvents.Disconnected, terminate);
40 call.addEventListener(CallEvents.Failed, terminate);
41
42 const geminiApiKey = (await ApplicationStorage.get("GEMINI_API_KEY")).value;
43
44 const calleeDestination = (await ApplicationStorage.get("CALLEE_DESTINATION")).value;
45 calleeCall = VoxEngine.callPSTN(calleeDestination, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
46 // Or call via an app user, SIP, or WhatsApp by uncommenting one of the lines below and commenting out the line above.
47 // calleeCall = VoxEngine.callUser(calleeDestination);
48 // calleeCall = VoxEngine.callSIP(`sip:${calleeDestination}@your-sip-domain`, (await ApplicationStorage.get("PSTN_CALLER_ID")).value);
49 // calleeCall = VoxEngine.callWhatsappUser({number: calleeDestination, callerid: (await ApplicationStorage.get("PSTN_CALLER_ID")).value)});
50 calleeCall.addEventListener(CallEvents.Disconnected, terminate);
51 calleeCall.addEventListener(CallEvents.Failed, terminate);
52
53 calleeCall.addEventListener(CallEvents.Connected, async () => {
54 calleeCall.record({hd_audio: true, stereo: true});
55
56 // Optional prompts to make the demo obvious on recordings.
57 call.say("Connected. Speak in English. The other party will hear Spanish.");
58 calleeCall.say("Connected. You will hear Spanish translation in real time.");
59
60 const GEMINI_CONNECT_CONFIG = {
61 responseModalities: ["AUDIO"],
62 thinkingConfig: {thinkingBudget: 0},
63 realtimeInputConfig: {
64 automaticActivityDetection: {
65 disabled: false,
66 prefixPaddingMs: 20,
67 silenceDurationMs: 200,
68 },
69 },
70 speechConfig: {
71 voiceConfig: {
72 prebuiltVoiceConfig: {voiceName: "Achird"},
73 },
74 },
75 // inputAudioTranscription: {},
76 // outputAudioTranscription: {},
77 systemInstruction: {
78 parts: [{text: SYSTEM_INSTRUCTIONS}],
79 },
80 };
81
82 try {
83 voiceAIClient = await Gemini.createLiveAPIClient({
84 apiKey: geminiApiKey,
85 model: GEMINI_MODEL,
86 backend: Gemini.Backend.GEMINI_API,
87 connectConfig: GEMINI_CONNECT_CONFIG,
88 onWebSocketClose: (event) => {
89 Logger.write("===Gemini.WebSocket.Close===");
90 if (event) Logger.write(JSON.stringify(event));
91 terminate();
92 },
93 });
94
95 // Caller (English) -> Gemini -> Callee (Spanish)
96 call.sendMediaTo(voiceAIClient);
97 voiceAIClient.sendMediaTo(calleeCall);
98
99 voiceAIClient.addEventListener(Gemini.LiveAPIEvents.SetupComplete, (event) => {
100 Logger.write("===Gemini.LiveAPIEvents.SetupComplete===");
101 if (event?.data) Logger.write(JSON.stringify(event.data));
102 });
103
104 voiceAIClient.addEventListener(Gemini.LiveAPIEvents.ServerContent, (event) => {
105 const payload = event?.data?.payload || {};
106 if (payload.inputTranscription?.text) {
107 Logger.write(`===EN=== ${payload.inputTranscription.text}`);
108 }
109 if (payload.outputTranscription?.text) {
110 Logger.write(`===ES=== ${payload.outputTranscription.text}`);
111 }
112 if (payload.interrupted !== undefined) {
113 Logger.write("===BARGE-IN=== Gemini.LiveAPIEvents.ServerContent");
114 voiceAIClient.clearMediaBuffer();
115 }
116 });
117
118 [
119 Gemini.LiveAPIEvents.SetupComplete,
120 Gemini.LiveAPIEvents.ServerContent,
121 Gemini.LiveAPIEvents.ConnectorInformation,
122 Gemini.LiveAPIEvents.Unknown,
123 Gemini.Events.WebSocketMediaStarted,
124 Gemini.Events.WebSocketMediaEnded,
125 ].forEach((eventName) => {
126 voiceAIClient.addEventListener(eventName, (event) => {
127 Logger.write(`===${event.name}===`);
128 if (event?.data) Logger.write(JSON.stringify(event.data));
129 });
130 });
131 } catch (error) {
132 Logger.write("===SOMETHING_WENT_WRONG===");
133 Logger.write(error);
134 terminate();
135 }
136 });
137});