Researchers from the University of Washington (UW) have developed a shape-changing smart speaker robotic prototype, designed to segment rooms into zones based on speech and sound, and track the location of individual speakers.
Deep-learning algorithms were used to train the machines to identify sound and mute certain areas or separate simultaneous conversations, with the researchers claiming this is possible even when two or more speakers have similar voices.
Each device in the fleet is about an inch in diameter, and equipped with a microphone that can automatically deploy from, and then return to, a charging station. This feature was introduced to enable the system to be moved between environments and set up automatically.
The seven-bot system is intended to replace a single central microphone in, for example, a conference room meeting, and to enable better control of in-room audio.
“If I close my eyes and there are 10 people talking in a room, I have no idea who’s saying what and where they are in the room exactly,” said co-lead author Malek Itani, a UW doctoral student in the Paul G. Allen School of Computer Science & Engineering.
“That’s extremely hard for the human brain to process. Until now, it’s also been difficult for technology.
“For the first time, using what we’re calling a robotic ‘acoustic swarm,’ we’re able to track the positions of multiple people talking in a room and separate their speech.”
According to the team, the UW system is the first to accurately distribute a robot swarm using only sound.
Previous similar research on swarms has attempted to do this using overhead or on-device cameras, projectors or special surfaces, the researchers explained.
On questions of privacy, those behind the project suggested it could actually benefit privacy, rather than pose a risk.
Itani explained that the swarm could be used to create ‘sound bubbles’ of a certain size in the future, expanding on the current capabilities of smart speakers. These ‘mute zones’ could enable better privacy, he added.
What’s more, automatic deployment means the robots can place themselves throughout a room for maximum accuracy, and spread themselves at a maximum possible distance from one another to make differentiating and locating speakers easier.
Current smart speaker technology operates differently, with multiple microphones clustered together within a single device, but these are too close together to allow for the creation of mute and active zones.
“We developed neural networks that use these time-delayed signals to separate what each person is saying and track their positions in a space,” added co-lead author Tuochao Chen, a UW doctoral student in the Allen School.
“So you can have four people having two conversations and isolate any of the four voices and locate each of the voices in a room.”
Some of the test environments for the swam included offices, living rooms and kitchens with groups of three to five people speaking.
The findings of these test shows that, across all these environments, the system could discern different voices within 1.6ft (50cm) of each other 90% of the time, without prior information about the number of speakers.
The system was reportedly able to process three seconds of audio in 1.82 seconds on average, which the researchers have said is fast enough for live streaming, but too much of a delay is present for real-time communications, such as digital meetings.