Scientists at the Carnegie Mellon University's Robotics Institute (CMU RI) are working on a computer system that can read body language right down to the position of fingers. The new process works in real time and even in crowds, opening the door to a more natural way for people and machines to interact.
At the moment, communicating with computers is mostly confined to typing, mouse clicks, and screen touching. Though talking is also being added to that list, human beings don't just communicate with words. As anyone who has dealt with a guilty teenager knows, half of human communication comes from body language and without taking that into account, interactions can become difficult and laborious.
The tricky bit is to get computers to identify human poses. These are often very subtle and include such details as the position of individual fingers, which can be obscured by objects or other people. On top of that, while large data banks exist of annotated facial expressions and body positions, there aren't any of hand gestures and poses.
The team led by Yaser Sheikh, associate professor of robotics at Carnegie Mellon, combined a number of approaches to solve this problem. One was to simply provide the computer with more data by having a pair of postgraduate students stand in front of a camera making thousands of different poses and gestures.
Another was to reverse the usual way a computer reads poses. Instead of looking at the whole person and working down to the gestures, the computer looked at the individual hands, arms, legs, and faces and assigned them to a person. According to the team, this was especially useful for looking at crowds.
The third part was to use CMU's Panoptic Studio, which is a two-story dome embedded with 500 video cameras. This allowed the computer to study poses from hundreds of different angles at once using a large number of subjects.
"A single shot gives you 500 views of a person's hand, plus it automatically annotates the hand position," says Hanbyul Joo, a PhD student in robotics. "Hands are too small to be annotated by most of our cameras, however, so for this study we used just 31 high-definition cameras, but still were able to build a massive data set."
The team is currently working on how to make the transition from 2D models to 3D models for better recognition. The ultimate goal is to produce a system that will allow a single camera and a laptop to read the poses from a group of people.
When the technology is mature, the CMU RI team sees it has having a number of applications. Not only will it allow people to interact with machines by simply pointing, it will also help self-driving cars to deduce when a pedestrian intends to step into the road, act as an automatic aid to diagnose behavioral disorders, and to track sports players on the field and interpret what they are doing.
The research will be presented at the 2017 Computer Vision and Pattern Recognition Conference in Honolulu, which runs from July 21 to 26.
Source: Carnegie Mellon University