What is speech-to-speech translation?
Speech-to-speech translation — sometimes called spoken translation or S2ST — takes audio in one language and returns audio in another. The input is a voice; the output is also a voice. That is the crucial difference from the translation most people know, where you type text and read the result, or a video shows a line of subtitles to scan.
A complete S2ST pipeline does three things: recognizes the spoken words, translates their meaning, and synthesizes a new voice that says them aloud. VoxisLive runs this chain continuously, so the result lands in your ears a few seconds after the original speaker — close enough to follow a conversation, a lecture or a scene as it happens. Your eyes stay on the gameplay, the slides, or the person talking.
Why hearing beats reading
Subtitles work, but they cost you something: pulling your gaze to the bottom of the screen, parsing text, snapping back — many times a minute. In a fast scene you miss the visuals; in a game you miss the moment; in a meeting you stop watching the speaker. Reading is a second task layered on top of watching.
Spoken translation removes that task. The language you understand simply arrives through your speakers while everything you see stays in front of you — the same reason live events use interpreters speaking into an earpiece rather than scrolling a transcript on a wall. Captions still exist if you want them: export a TXT, SRT or VTT transcript after any session. They're a record, not the thing you rely on in the moment.
A native simultaneous interpreter
VoxisLive captures system audio with Windows WASAPI process-loopback — no virtual cable, no extra driver, no bot in your call — and excludes its own output so it never translates the voice it just produced.
The captured speech goes to a native simultaneous interpreter model: rather than waiting for a sentence to finish, it begins translating while the speaker is still talking and stays a few seconds behind — exactly how a human interpreter works in a conference booth. That short, steady lag is what makes the output feel live instead of stop-and-start.
Two modes
In Video / Game mode translation is one-way: the other voice comes into your language and the original audio is ducked so the spoken translation sits clearly on top. In Meeting mode it is two-way: the other party is translated into your language, and your own speech is translated into theirs and fed into a virtual microphone — with no bot in the participant list.
VoxisLive runs on Windows 10 and 11 and speaks 79 target languages. Get it on the Microsoft Store, or run the free open-source build from GitHub with your own key.