Cooking, they say, is as much an art as a science, so it's no surprise that robots have a difficult time in the kitchen. Perhaps one day robot chefs will be as commonplace as blenders, but they will still need to learn their job. To help them, scientists at the University of Maryland and NICTA, Australia are working on ways for robots to learn how to cook by watching YouTube videos.

Cooking is an everyday task, but like walking, talking, and many other mundane things, we don't appreciate how difficult it is. Take, for example, the simple chef's knife. It's a simple 8-in length of triangular steel, yet in the hands of a skilled cook, it can replace almost every fancy gadget in the kitchen from the food processor to the garlic press.

That's because human beings are very good at manipulating objects. The human hand is amazingly versatile and human beings are endlessly inventive. The trick is to find a way to get a robot to even remotely do what a human can do when chopping an onion or whisking an egg – not to mention something more complicated.

The traditional way engineers have handled the problem is to simplify the task. That is, to break down the job and redesign it so it can be done with claws or pincers, or creating specialized manipulators that can do one task very well and nothing else, or overwhelm the problem with universal graspers like the Versaball.

Grasps identified by the robot

But this only goes so far. There are many tasks that still require the human touch and the robot needs to learn that touch if it's going to do the same thing. In some respects, this is a mechanical problem, but in most others, it's a matter of how to teach the robot. One way is to analyze the job and directly program the machine. Another is to use motion capture gloves or hand trackers to record the needed motions, and a third way is to guide the robot directly like a teacher showing a pupil how to slice some meat.

The Maryland and NICTA team are working on a more direct approach that allows the robot to learn for itself. In this case, by looking at videos of cooking instructions taken directly off the internet. The trick is to find a way for robots to learn how to study human actions, then turn them into commands that are within the ability of the machine to duplicate.

The team says that dealing with raw video isn't easy. Unlike special videos made in a lab to support an experiment, those found on YouTube and other services are unpredictable, with all sorts of scenery, backgrounds, lighting, and other complexities to sort out. This requires some sophisticated image recognition as well as techniques that allow the robot to break down the observed actions to an abstract "atomic" level. This is done by using a pair of Convolutional Neural Network (CNN) based recognition modules.

Steps in how the robot learns from videos

The key to the CNN is the artificial neuron, which is a mathematical function that imitates living neurons. These artificial neurons are hooked together to form an artificial neural network. For the cooking robot, these networks act like the human visual system, using overlapping layers of neural connections to study images. This overlapping provides very high resolution and the data from the image is very resistant to distortion as it's translated from one form to another.

In the case of the Maryland/NICTA system, there are two CNN visual recognition modules. One of these looks at the hands of the cook in the video and works out what sort of a grasp each one is doing. Meanwhile, the other determines how the hand and the object it's holding are moving, and by breaking down the movements, analyzing them and deducing how the robot can use the moves to complete its own tasks.

The robot looks for one of six basic types of grasps and studies how they are used and change through time in a video sequence. It can then decide which manipulator, if it has several, to choose to replicate the grasp, such as a vacuum gripper to hold something firmly, or a parallel gripper for more precision. In addition, it identifies the object being grasped, such as a knife, an apple, a bowl, a salmon, or a pot of yoghurt, among others.

The next step is to determine which one of ten common cooking scenarios, such as cutting, pouring, spreading, chopping, or peeling, is being carried out. That done, the system then identifies a much larger group of actions that make up the scenario, breaks them down, then determines how to duplicate them in a useful sequence called a "sentence." In this way, it can go from the chaos of "How to debone a ham?" to turning it into useful actions that the robot can perform.

The researchers say that in future they will work on refining the classifications and look at how to use the identified grasps to predict actions for a more detailed analysis as they work out the "grammar" of actions.

The results of the team's work will be presented at the 29th annual conference of the Association for the Advancement of Artificial Intelligence.

View gallery - 4 images