VoxisLive Download

How VoxisLive Works — Real-Time Voice Translation on Windows

VoxisLive is a Windows app that captures system audio directly — no virtual cable, no driver install — detects speech on-device, sends it to Gemini Live for real-time speech-to-speech translation, and speaks the result back through your speakers in a natural voice while the original audio is automatically ducked. Here's how it works:

How does VoxisLive capture audio without a virtual audio cable?

VoxisLive reads your system audio using WASAPI loopback — the same low-level Windows Audio Session API that screen recorders use to capture "what's playing." This means there is no virtual audio cable to install, no VB-CABLE driver, and no changes to your audio routing. Whatever is playing on Windows — a YouTube video, a game, a Zoom call, a Twitch stream — Voxis intercepts that audio at the OS level, before it ever leaves your machine. The capture is zero-latency relative to playback and adds no audible artifacts to what other apps hear.

WASAPI loopback is a native Windows capability that has existed since Vista. Because VoxisLive calls it through the standard Windows API, it works across every modern Windows 10 and 11 configuration without compatibility shims or third-party drivers. There is nothing to uninstall when you are done, and your audio device configuration remains exactly as it was.

Why this matters for you: Competing approaches typically require routing audio through a virtual audio device (VB-CABLE, Virtual Audio Cable, JACK for Windows). Those introduce an extra audio hop, require driver installation (which needs admin rights and sometimes a reboot), and can cause conflicts with other software. VoxisLive sidesteps all of that.

See VoxisLive for Windows — Download to get started.

How does VoxisLive detect speech?

Before any audio leaves your device, VoxisLive runs on-device voice activity detection (VAD) to separate speech segments from silence, background noise, and music. This detection runs locally — on your CPU — with no network round-trip. Only segments that are identified as containing human speech are sent forward to the translation engine.

On-device VAD serves two purposes. First, it reduces latency: the translation request is fired at the moment speech is confidently detected, not after a fixed timer expires. Second, it reduces cost: silence, hold music, and ambient sound never consume translation capacity, which keeps managed-plan minute usage honest. If you are watching a movie and someone delivers a ten-second line, Voxis buffers that segment and dispatches it as a coherent utterance — it does not chop speech mid-word.

The VAD stage also handles the self-exclusion problem. VoxisLive tracks its own spoken output and suppresses those frames so the translation loop never hears its own voice and re-translates it. This is a prerequisite for reliable two-way use, and it is handled entirely locally.

How does VoxisLive translate in real time?

After speech detection, the audio segment is passed to Gemini Live — Google's multimodal real-time model — for speech-to-speech translation. Gemini Live accepts audio input directly and returns translated audio output, meaning it handles both the transcription and the translation in a single low-latency pass rather than chaining a separate speech-to-text service with a translation service with a text-to-speech service.

This architecture is what makes simultaneous-style translation possible. Traditional translation pipelines have three sequential network calls (ASR → MT → TTS), each adding hundreds of milliseconds of latency. Gemini Live collapses those into one streaming call. VoxisLive manages the session lifecycle, streaming the audio in and queuing the translated audio out.

VoxisLive is open-core: you can bring your own Gemini API key (BYOK) at no subscription cost — translation runs against your own Google AI quota. If you prefer not to manage API keys, Creator ($19/month) and Pro ($39/month) plans include managed cloud minutes with no setup required. Compare plans on the pricing page.

How does VoxisLive speak the translation back?

The translated audio from Gemini Live is played back through your default output device in a natural synthesized voice. Two signal-processing steps happen simultaneously:

Psychoacoustic ducking. At the moment translation audio begins playing, the original source audio is briefly reduced in volume (ducked). This mirrors how professional simultaneous interpreters work — the interpreter's voice rides over the original rather than competing with it at equal level. The result is that you hear the translation clearly without losing the acoustic context of the original (tone, emotion, speaker identity).

Latency synchronization. Voxis aligns the translation playback with the speech segment it corresponds to, compensating for the variable processing time of the Gemini Live call. This prevents the translated voice from drifting out of sync with on-screen action over long sessions.

The output voice quality is governed by Gemini Live's synthesis, which produces human-like prosody. Voxis does not apply additional compression or equalization that would degrade voice clarity.

Is VoxisLive simultaneous interpretation — is it really real-time?

VoxisLive is near-simultaneous, not zero-delay. There is an inherent minimum latency between a speaker finishing a sentence and Voxis speaking the translation — this is the time required for VAD to confirm the utterance has ended, plus the Gemini Live round-trip. In practice, under normal network conditions, this is roughly one to two seconds behind the original speech.

For comparison, professional human simultaneous interpreters in a UN booth typically work two to four seconds behind the speaker. VoxisLive operates in that same range or faster, depending on utterance length and network latency. It is not suitable for applications that require zero latency (real-time captioning SLAs, for example), but it is well within the threshold that makes media, meetings, and gaming comfortable.

Translation quality scales with utterance length. Voxis collects a complete utterance before translating, which gives Gemini Live enough context for accurate output. Very short fragments ("Yeah", "Okay", "Thanks") are batched or deferred to avoid poor-quality one-word translations.

Explore how Voxis handles live meetings and game audio translation.

What about meetings — does VoxisLive use a bot to join calls?

No. VoxisLive never joins a call as a bot participant, requests meeting host permissions, or interacts with the meeting application in any way. It reads the audio that the meeting app (Zoom, Teams, Google Meet, Discord) is already playing to your speakers via WASAPI loopback, exactly as any recording software would. From the meeting platform's perspective, VoxisLive does not exist.

This has three practical consequences. First, other participants never see a bot entry in the participant list. Second, VoxisLive works with every meeting platform without needing a platform-specific integration — if it plays audio on Windows, Voxis can translate it. Third, there is no dependency on platform APIs that can be revoked or rate-limited.

Two-way meeting mode works the same way: Voxis captures both directions from the system audio mix. It distinguishes its own output frames (using the self-exclusion mechanism described above) so that speaker B does not hear a double-translation of the reply Voxis just synthesized for speaker A.

See VoxisLive's meeting use case page for a step-by-step walkthrough.

Privacy and BYOK — where does my audio go?

With BYOK (your Gemini API key): Audio goes from your device directly to the Google AI API endpoint associated with your own Google account. VoxisLive's servers are not in the path. Google's data handling for the Gemini API is governed by Google's own terms and AI principles. VoxisLive never stores, logs, or processes that audio.

With managed plans (Creator / Pro): Audio segments travel to VoxisLive's cloud infrastructure, which proxies the call to Gemini Live. VoxisLive processes audio in transit and does not retain audio content after the translation session ends. See the Privacy Policy for the full data retention schedule.

In both modes, the on-device VAD stage means that silence and non-speech audio never leave your machine at all. Only confirmed speech segments are transmitted.

VoxisLive is a single-user, local application. It does not record ambient audio in the background, does not run as a system service unless you configure it to, and does not have microphone access (it captures system output audio, not microphone input).

Common questions

Does VoxisLive work with headphones and Bluetooth audio?

Yes. WASAPI loopback captures the audio mix at the Windows mixer level, before it is sent to any specific output device. Switching between headphones, speakers, or Bluetooth does not affect capture. The translation output follows your default playback device.

Will it work if the original audio is not in English?

VoxisLive supports multi-language source audio. Gemini Live handles the source language detection internally. You configure your target (output) language and Voxis handles the rest regardless of what language the source is in.

Do I need to leave a terminal or command prompt open?

No. VoxisLive runs as a standard Windows application with a graphical interface. No command line is required for normal use.

Is there a free version?

Yes. The Developer tier is free and uses your own Gemini API key (BYOK). Google provides free API quota for Gemini, so translation can be genuinely cost-free depending on your usage. Download VoxisLive or see all plans.

---

*Ready to try it?* Download VoxisLive for Windows — free to start, no virtual cable required.

Hear every language, in real time.

Download