Step 1 — Audio capture, without drivers
VoxisLive uses WASAPI loopback — the same low-level Windows Audio Session API that screen recorders use to capture “what's playing.” It is native Windows capability: no virtual audio cables, no driver installation, no audio routing changes. Capture is zero-latency relative to playback and adds no audible artifacts.
Competing tools typically route audio through virtual devices like VB-CABLE, which need driver installs (often admin rights and a reboot) and can conflict with exclusive-mode audio, ASIO drivers or anti-cheat systems. VoxisLive skips that class of problem entirely.
Step 2 — On-device speech detection
The app runs on-device voice-activity detection (VAD) to separate speech from silence, background noise and music — locally, on your CPU, with no network round-trip. Only segments identified as human speech proceed to translation, which cuts latency and protects your minute balance. VAD also tracks VoxisLive's own spoken output so it never re-translates its own voice.
Step 3 — One-pass simultaneous translation
Speech segments go to a multimodal real-time model that handles recognition, translation and voice synthesis in a single low-latency pass — collapsing the three sequential network calls of a traditional pipeline (speech-to-text → translation → text-to-speech) into one. Like a human interpreter in a booth, it begins translating while the speaker is still talking.
Step 4 — Spoken playback with ducking
The translated voice plays through your output device while two things happen in parallel: psychoacoustic ducking lowers the original audio while the translation speaks (mirroring professional simultaneous interpretation), and latency synchronization keeps each translation aligned with its speech segment so long sessions never drift.
