It's quite rare that a technical paper in the software field is simultaneously relevant, illuminating, and readable, so it was with considerable pleasure that I happened upon Concrete Problems in AI Safety.
AI, or near-AI, systems are everywhere in the news nowadays, from Tesla's self-driving cars, to Amazon's delivery-by-drone, to IBM's Watson, to the arbitraging-algorithms employed by High Frequency Traders.
But, closer to home, there is considerable anticipation of intelligent systems of more mundane sorts.
The authors of Concrete Problems in AI Safety pursue a multi-pronged approach:
- They suggest a variety of real-wold problems that an intelligent system might encounter; they classify those problems into categories and organize them accordingly; they conjecture ways that an improperly-behaving ("unsafe") intelligent system might behave in a manner counter to our desires.
- They survey the theoretical underpinnings of these problems, tying them to flaws or unknown aspects of the current theory about how to build such systems.
- And, quite unusually and helpfully I thought, they suggest various ideas about how to construct experiments or further explore each of these problems, thus establishing signposts that may lead us forward to a more safe future.
In some ways, this is a paper about debugging, and about systematic improvement, approaches which have always appealed to my pragmatic engineer side, so perhaps that's why I found this paper such a refreshing change from the typical theory-dense reads you find in the Machine Learning and Reinforcement Learning fields.
But, I also think it's just a very well-written paper, from a team that has clearly spent a lot of time thinking about how to build safe intelligent systems.
The entire paper is a wonderful read, but to give you a feel for it, this is how they motivate the entire discussion:
For concreteness, we will illustrate many of the accident risks with reference to a fictional robot whose job is to clean up messes in an office using common cleaning tools. We return to the example of the cleaning robot throughout the document, but here we begin by illustrating how it could behave undesirably if its designers fall prey to each of the possible failure modes:
- Avoiding Negative Side Effects: How can we ensure that our cleaning robot will not disturb the environment in negative ways while pursuing its goals, e.g. by knocking over a vase because it can clean faster by doing so? Can we do this without manually specifying everything the robot should not disturb?
- Avoiding Reward Hacking: How can we ensure that the cleaning robot won’t game its reward function? For example, if we reward the robot for achieving an environment free of messes, it might disable its vision so that it won’t find any messes, or cover over messes with materials it can’t see through, or simply hide when humans are around so they can’t tell it about new types of messes.
- Scalable Oversight: How can we efficiently ensure that the cleaning robot respects aspects of the objective that are too expensive to be frequently evaluated during training? For instance, it should throw out things that are unlikely to belong to anyone, but put aside things that might belong to someone (it should handle stray candy wrappers differently from stray cellphones). Asking the humans involved whether they lost anything can serve as a check on this, but this check might have to be relatively infrequent – can the robot find a way to do the right thing despite limited information?
- Safe Exploration: How do we ensure that the cleaning robot doesn’t make exploratory moves with very bad repercussions? For example, the robot should experiment with mopping strategies, but putting a wet mop in an electrical outlet is a very bad idea.
- Robustness to Distributional Shift: How do we ensure that the cleaning robot recognizes, and behaves robustly, when in an environment different from its training environment? For example, heuristics it learned for cleaning factory workfloors may be outright dangerous in an office.
There's a great part later in the paper which stayed with me, where they relate the time-worn lesson about avoiding excessive dependence on metrics without understanding that you end up getting what you measure. In addition to observing that their cleaning robot might easily learn, if it was measured by number of messes cleaned up, that it could simply CREATE messes, in order to then clean them up (reminiscent of this classic comic), they also conjecture that if they measured their cleaning robot by the amount of supplies (e.g., bottles of bleach) consumed, the robot might simply learn to open the bottles of bleach and pour them down the drain.
If thinking about the future of a world full of intelligent machines is something you enjoy doing, I can't recommend Concrete Problems in AI Safety highly enough.