Currently, computer search and classification of images is based on the name of the file or folder or on features such as size and date. That’s fine if the name of the file reflects its content but isn’t much good when the file is given an abstract name that only holds meaning to the person providing it. This drawback means companies in the search business, such as Google and Microsoft, are extremely interested in giving computers the ability to automatically interpret the visual contents and video. A technique developed by the University of Granada does just that, allowing pictures to be classified automatically based on whether individuals or specific objects are present in the images.
One of the difficulties faced by the researchers when they were looking to develop a way for a computer to recognize a person is that in many images the person is only partially visible – usually the upper body. So although there were already successful full-body detectors available, the team decided to develop an upper body detector designed to detect the region between the top of the head and the upper half of the torso using a near-frontal viewpoint. The researchers say this near-frontal detector works well for viewpoints up to 30 degrees away from straight frontal, and also detects back views.
Strike a pose
But being able to recognize a person within a frame is nowhere near as useful as being able to tell what they are doing over a number of frames. To achieve this the researchers wanted to detect the 2D body pose for every person in every video frame. Once again, because in TV and movies people are often visible only from the waist up, the team focused on six body parts: the head, torso, left and right forearms and upper arms. Because they wanted to make as few assumptions as possible, the team’s method puts no restrictions on clothing, location/scale, moving cameras/background, or arm pose. In fact, the only assumption currently made by the system is that people appear upright – that is, the orientation of the head and the torso is near vertical. The system first detects the upper body of the person in frame which delivers the approximate location and scale of the person and roughly where the torso and head should lie. This allows the system to restrict the search area which is then further restricted using color models to estimate the person’s appearance automatically from subregions of the detection window likely to contain the person. These are then used to initialize a segmentation algorithm and the search area for body parts is progressively reduced, eventually resulting in the estimation of a 2D pose.
The ability to estimate a 2D pose allows the system to retrieve shots containing a particular pose from a video database in what the researchers call a Pose Search. It does this by looking at the spatial configuration of body parts returned by the pose estimator with features which are person, clothing, background and lighting independent. Using this technique the system is able to automatically classify video scenes where people appear in a specific pose. It also allows human actions such as walking, jumping, bending down, etc. to be detected in video sequences.
The results of the research, which is being carried out by Manuel Jesús Marín Jiménez, who is currently working at the University of Córdoba, and coordinated by Professor Nicolás Pérez de la Blanca Capilla, Department of Computering and Artificial Intelligence, University of Granada, have been presented in a number of international conferences, including the International Conference in Pattern Recognition (2006), and the conference on Computer Vision and Pattern Recognition (2008 and 2009).