A team led by Kazuhiro Nakadai at Honda Research Institute-Japan (HRI-JP) is improving how robots process and understand sound. The robot, aptly called HEARBO (HEARing roBOt), can parse four sounds (including voices) at once, and can tell where the sounds are coming from. The system, called HARK, could allow future robot servants to better understand verbal commands from several meters away.
The HARK system (HRI-JP Audition for Robots with Kyoto University) processes audible noise with eight microphones inside the robot's head. First the software singles out the sounds generated by its 17 motors, which are cancelled in real-time in a process known as "ego-noise suppression." It then processes the remaining audio, while applying a sound source localization algorithm to pinpoint the origin of a sound to within one degree of accuracy.
"By using HARK, we can record and visualize, in real time, who spoke and from where in a room," explains Nakadai on the HRI-JP website. "We may be able to pick up voices of a specific person in a crowded area, or take minutes of a meeting with information on who spoke what by evolving this technology."
In one experiment, the robot took food orders from four people speaking simultaneously – and knew who had ordered what. In another experiment, the robot played a game of rock-paper-scissors with three people. Each person said either rock, paper, or scissors at the same time, and the robot was able to determine who won. Others have taught the robot what different musical instruments sound like, which could allow the robot to separate a song into various parts.
HARK allows the robot to parse up to four speakers simultaneously, as shown in this example of "verbal rock-paper-scissors"
HARK represents just one domain of artificial intelligence known as robot audition, which any practical robot helper will require in daily life. Honda has reportedly invested more than US$60 million dollars into its humanoid robot, ASIMO, with plans to one day commercialize. Earlier work by the same team was applied to the latest version of ASIMO, which can understand different words spoken by three people simultaneously.
In the first video demonstration below, HEARBO is bombarded with a beeping alarm clock, music, and a person speaking to it. Not only can it distinguish between the types of sound it is hearing, but it turns its head in the direction of the sound it is seeking. In the second demonstration, the robot listens to verbal commands while music plays. It estimates the song's tempo and dances to the rhythm, and performs ego-noise suppression to cancel out its own servo noise.