How VoxisLive Works — Driverless Real-Time Voice Translation on Windows

Step 1 — Audio capture, without drivers

VoxisLive uses WASAPI loopback — the same low-level Windows Audio Session API that screen recorders use to capture “what's playing.” It is native Windows capability: no virtual audio cables, no driver installation, no audio routing changes. Capture is zero-latency relative to playback and adds no audible artifacts.

Competing tools typically route audio through virtual devices like VB-CABLE, which need driver installs (often admin rights and a reboot) and can conflict with exclusive-mode audio, ASIO drivers or anti-cheat systems. VoxisLive skips that class of problem entirely.

Step 2 — On-device speech detection

The app runs on-device voice-activity detection (VAD) to separate speech from silence, background noise and music — locally, on your CPU, with no network round-trip. Only segments identified as human speech proceed to translation, which cuts latency and protects your minute balance. VAD also tracks VoxisLive's own spoken output so it never re-translates its own voice.

Step 3 — One-pass simultaneous translation

Speech segments go to a multimodal real-time model that handles recognition, translation and voice synthesis in a single low-latency pass — collapsing the three sequential network calls of a traditional pipeline (speech-to-text → translation → text-to-speech) into one. Like a human interpreter in a booth, it begins translating while the speaker is still talking.

Step 4 — Spoken playback with ducking

The translated voice plays through your output device while two things happen in parallel: psychoacoustic ducking lowers the original audio while the translation speaks (mirroring professional simultaneous interpretation), and latency synchronization keeps each translation aligned with its speech segment so long sessions never drift.

How fast is it, really?

VoxisLive is near-simultaneous, not zero-delay: typically one to two seconds behind the original speech, depending on utterance length and network latency. For reference, professional human simultaneous interpreters work two to four seconds behind the speaker — VoxisLive operates in that range or faster. Very short fragments are batched to avoid poor single-word translations.

Meetings without a bot

VoxisLive never joins a call as a participant, never requests host permissions, and never touches the meeting app. It reads the audio already playing to your speakers, which makes it invisible to other participants and identical across Zoom, Teams, Google Meet, Webex and Discord. In two-way mode it also translates your own speech into the meeting language through a virtual microphone.

What leaves your machine

With the open-source BYOK build, audio goes directly to Google's API under your own key — VoxisLive servers are never involved. With the managed Store app, detected speech segments are proxied to the model and no audio is retained after the session ends. Silence and non-speech audio never leave your machine at all, thanks to on-device VAD.

FAQ

Common questions

01Do I need VB-CABLE or a virtual audio driver?

No. VoxisLive uses WASAPI loopback, a native Windows API available on Windows 10 and 11. There is nothing to install or route, and no new device appears in your audio settings.

02Does VoxisLive join my meeting as a bot?

Never. It captures your own system audio locally, so no third attendee appears in Zoom, Teams or Meet, no permission prompt fires, and no platform integration is needed.

03How much delay should I expect?

Roughly one to two seconds behind the original speech after speech detection and AI processing — the same range as professional human simultaneous interpreters, or faster.

04Does translation quality depend on utterance length?

Yes — longer utterances translate better. Very short fragments are batched or deferred to avoid poor single-word translations.

From your speakers to your language in about two seconds.