HOW IT WORKS

From your speakers to your language in about two seconds.

VoxisLive captures system audio directly through WASAPI loopback, detects speech on-device, translates it with a real-time AI model, and speaks the result back — while automatically lowering the original audio.

Step 1 — Audio capture, without drivers

VoxisLive uses WASAPI loopback — the same low-level Windows Audio Session API that screen recorders use to capture “what's playing.” It is native Windows capability: no virtual audio cables, no driver installation, no audio routing changes. Capture is zero-latency relative to playback and adds no audible artifacts.

Competing tools typically route audio through virtual devices like VB-CABLE, which need driver installs (often admin rights and a reboot) and can conflict with exclusive-mode audio, ASIO drivers or anti-cheat systems. VoxisLive skips that class of problem entirely.

Step 2 — On-device speech detection

The app runs on-device voice-activity detection (VAD) to separate speech from silence, background noise and music — locally, on your CPU, with no network round-trip. Only segments identified as human speech proceed to translation, which cuts latency and protects your minute balance. VAD also tracks VoxisLive's own spoken output so it never re-translates its own voice.

Step 3 — One-pass simultaneous translation

Speech segments go to a multimodal real-time model that handles recognition, translation and voice synthesis in a single low-latency pass — collapsing the three sequential network calls of a traditional pipeline (speech-to-text → translation → text-to-speech) into one. Like a human interpreter in a booth, it begins translating while the speaker is still talking.

Step 4 — Spoken playback with ducking

The translated voice plays through your output device while two things happen in parallel: psychoacoustic ducking lowers the original audio while the translation speaks (mirroring professional simultaneous interpretation), and latency synchronization keeps each translation aligned with its speech segment so long sessions never drift.

TECHNICAL REFERENCES & CITATIONS

Architecture & Standards Documentation

VoxisLive's engine is built on open standards and enterprise AI specifications. Read the underlying documentation:

WASAPI Loopback Capture: Built using Microsoft Core Audio APIs (WASAPI) for zero-driver system audio capture.
Google Gemini Live API: Real-time WebSocket translation powered by Google AI Studio Gemini Live API.
Alibaba Qwen Realtime API: Real-time streaming translation powered by Alibaba Cloud DashScope Qwen Realtime API.
On-Device VAD & Diarization: Local speaker diarization powered by sherpa-onnx C++ CPU models.

VoxisLive main window translating a video, showing the live bilingual transcript

Watch it live — switching between five languages in one session

This screen recording shows a real conference talk translated in real time, with the target language switched live, mid-talk, across Spanish, Italian, Turkish, Chinese and Korean — same source audio, five different spoken outputs, no restart between switches. It's the exact WASAPI loopback pipeline described above: no virtual audio cable, no subtitles, the translation spoken back in a natural voice while the original stays audible underneath.

How fast is it, really?

VoxisLive is near-simultaneous, not zero-delay: typically one to two seconds behind the original speech, depending on utterance length and network latency. For reference, professional human simultaneous interpreters work two to four seconds behind the speaker — VoxisLive operates in that range or faster. Very short fragments are batched to avoid poor single-word translations.

Meetings without a bot

VoxisLive never joins a call as a participant, never requests host permissions, and never touches the meeting app. It reads the audio already playing to your speakers, which makes it invisible to other participants and identical across Zoom, Teams, Google Meet, Webex and Discord. In two-way mode it also translates your own speech into the meeting language through a virtual microphone.

What leaves your machine

With the open-source BYOK build, audio goes directly to Google's API under your own key — VoxisLive servers are never involved. With the managed Store app, detected speech segments are proxied to the model and no audio is retained after the session ends. Silence and non-speech audio never leave your machine at all, thanks to on-device VAD.

FAQ

Common questions

01Do I need VB-CABLE or a virtual audio driver?

No — not to capture your PC audio. VoxisLive uses WASAPI loopback, a native Windows API available on Windows 10 and 11, so there is nothing to install or route and no new device appears in your audio settings. A virtual microphone is only optional, for sending your translated voice back into a two-way meeting; without one, Meeting mode runs listen-only.

02Does VoxisLive join my meeting as a bot?

Never. It captures your own system audio locally, so no third attendee appears in Zoom, Teams or Meet, no permission prompt fires, and no platform integration is needed.

03How much delay should I expect?

Roughly one to two seconds behind the original speech after speech detection and AI processing — the same range as professional human simultaneous interpreters, or faster.

04Does translation quality depend on utterance length?

Yes — longer utterances translate better. Very short fragments are batched or deferred to avoid poor single-word translations.

Free to start · 10 free minutes every day

Hear every language, in real time.

Runs on Windows 10 and 11 — no drivers, no setup ritual, no bot in your call.

Get it on Microsoft Store Open source on GitHub