Sound-tracking headphones let you eavesdrop in multiple languages

By Michael Franco

May 17, 2025

A closer look at the prototype multi-language, multi-speaker translation headphones

Shyam Gollakota

View 1 Image

1/1

A closer look at the prototype multi-language, multi-speaker translation headphones

Shyam Gollakota

As an American who has been living in Portugal and studying the language sporadically for about 18 months, I'm finally at the point where I can hold a basic conversation – if I know what subject we're focusing on. So at the supermarket checkout counter, or the bank, I'm good. Plop me in the middle of a bus station though, and the conversations swirling around me devolve into a series of shhhzzs and ows from which I have trouble picking out even a single word, let alone an entire sense of meaning.

That's why my ears particularly perked up when I heard about a prototype set of headphones that could actually monitor its surroundings, determine how many different people are speaking, and translate each linguistic thread pretty much in real time.

The headphone-based system known as Spatial Speech Translation was actually created using off-shelf components by researchers at the University of Washington, and builds off previous work they did in using headphones to isolate one voice from a group conversation.

Senior Author Shyam Gollakota from UW's Mobile Intelligence Lab told us that the device consists of a pair of Sony SH-100XM4 noise-cancelling headphones fused to a pair of Sonic Presence SP15C binaural headphones. Binaural headphones are those that capture sound much in the same way humans do – from two different sources.

Gollakota says that after the microphones pick up the sound, the feed is delivered into a mobile device running neural network models in real time. In this case, the team used a laptop powered by Apple's M2 silicon chip, which is able to run neural networks. This feed is then translated and fed back through the headphones with a delay that can be as short as 1-2 seconds, although in testing, users preferred a 3-4 second delay, as the system made fewer mistakes that way.

The system is not only able to pick out different voices in a group conversation, but it also preserves the natural rhythms of speech, making the translated feed sound very natural. It also adapts as wearers move around a room or rotate their heads, using AI to lock in on different conversational threads.

“Our algorithms work a little like radar,” said lead study author Tuochao Chen, a UW doctoral student in the Allen School. “So it’s scanning the space in 360 degrees and constantly determining and updating whether there’s one person or six or seven.”

Currently, the system has been trained using conversational Spanish, French, and German but the researchers say it could eventually be able to handle about 100 languages. They are currently working on improving the speed and accuracy rate of the system. They've also made the code that powers the system open source, so that others can experiment with it.

“This is a step toward breaking down the language barriers between cultures,” concludes Chen. “So if I’m walking down the street in Mexico, even though I don’t speak Spanish, I can translate all the people’s voices and know who said what.”

You can see the breakthrough device in action in the following video.