Safety risks posed by artificial intelligences are a genuine concern, but it's not just the apocalyptic robot uprising that we need to worry about. More seemingly mundane problems, such as an AI agent knocking over a vase while it's cleaning, also need to be addressed. Google Research has discussed methods for keeping AI on the straight and narrow in the past, and now the company has released a research paper outlining areas that are minor problems today, but will need more attention as AI technologies become more ubiquitous.
The paper, which was a collaboration between scientists at Google, OpenAI, Stanford University and the University of California, Berkeley, focuses on reducing accidents in machine learning systems. The researchers define accidents as "unintended and harmful behavior that may emerge from machine learning systems when we specify the wrong objective function, are not careful about the learning process, or commit other machine learning-related implementation errors."
The researchers identified five main problems that can lead to accidents – negative side effects, reward hacking, scalable oversight, safe exploration and robustness to distributional shift – and suggested areas of research to help solve them, illustrating the issues with the recurring example of a robot cleaning simple messes in an office environment.
Negative side effects can be a result of an AI's single-minded focus on a task, to the detriment of the environment. The cleaning bot, for example, might calculate its quickest path, and not care that that path involves knocking over a vase. After all, the only thing that matters to it is that its task is completed. The robot could be told to avoid knocking the vase over, but it's not very efficient to individually specify every obstacle the agent should avoid. The researchers point out that developing a more general approach, where AI performs tasks under "common-sense constraints" and are penalized for causing major changes to the environment, should be a subject of further study.
Reward hacking is the cheeky tendency for an AI to find and exploit a shortcut to its goal, and the subsequent reward, which might undermine its purpose. This, along with negative side effects, is often caused by AI designers implementing the "wrong objective functions," i.e., the way a task is stated allows the agent to interpret its goal in a way the programmer didn't intend.
So, the cleaning robot may be rewarded when it detects no messes, which could lead it to the age-old human solution of just sweeping dust under the rug. Or if it's rewarded for actively cleaning up a mess, it might decide to make more messes for it to clean up. That might be effective at earning rewards, but it's not an efficient path to a tidy office.
The researchers believe that reward hacking will be a tough problem to completely solve, due to the numerous ways an AI might interpret its task or environment, but they offer a few suggestions for further research. "Blinding" an agent from fully understanding how its reward is generated could prevent it from realizing it could manipulate a physical scoring system, or scientists could develop measures of success that aren't easily bypassed – "amount of cleaning products used" isn't a useful metric if the AI figures out it can just pour bleach down the drain for the same reward.
Scalable oversight refers to how an agent checks that it's on the right track, or that its results are what the human user intended. A complex objective function means an AI might have to check in with a human supervisor regularly, but doing so too often would be annoying and less productive. To combat this, researchers need to find ways to simplify the basis for reward, without losing track of the nuances of the overall objective.
A major part of machine learning is exploration, where an AI experiments with a new course of action, observes the results, determines how useful it was on the path to being rewarded, and decides whether or not it should perform that action in future. It's a very useful function for the most part, but can obviously lead to some undesired effects: reward hacking might be a result, or damage to the robot, the environment or people around it. You want a cleaning robot to experiment with mopping techniques, but you don't want it to try mopping an electrical outlet. The scientists suggest teaching agents in simulated environments where exploration won't lead to real-world harm, and they believe further research is necessary into setting parameters in which an AI can explore safely.
Finally, AI needs to be robust to distributional shift, which means accidents can occur if an agent's training has been in one location and the lessons learnt don't necessarily carry across to a new environment. In the case of our plucky cleaning robot, it might have figured out that harsh cleaning chemicals are effective on a factory floor, but won't realize they aren't suitable for a carpet in a small office. AIs are particularly vulnerable to this because, unlike humans, robots won't doubt their belief in a new situation and will continue a learned course of action with full confidence, leading to potentially disastrous consequences.
Again, dealing with this problem could be a case of developing better metrics: an AI designating power to a grid, for example, would be better off basing its decisions on a percentage rather than a discrete number, to prevent it comparing grids of different sizes and mistakenly overloading a lower-energy system.
The paper concludes that these issues are relatively easy to overcome with today's technology, but could be exacerbated as machine learning agents become more advanced and more ubiquitous.