AI safety accidents can be categorized into three main causes: (1) wrong objective functions, where the system fails to achieve the intended goal due to misaligned incentives (e.g., reward hacking, side effects); (2) expensive-to-evaluate objective functions, where the true objective is too costly to assess frequently, requiring semi-supervised or active learning approaches; and (3) correct objective functions with problems in learning or inference, such as safe exploration issues and distributional shift between training and test conditions. This framework, presented by Dario Amodei (Google Brain researcher and Open Philanthropy advisor) in his 2016 paper, emphasizes near-term, empirical AI safety problems rather than speculative far-future scenarios.
Deep Dive
Prerequisite Knowledge
- No data available.
Where to go next
- No data available.
Deep Dive
Concrete Problems in AI Safety — Talk by Dario Amodei
Added:Um so as uh as Bill mentioned I'm a researcher on the uh Google brain team and I'm also an adviser to the uh open philanthropy project which uh thinks about uh social impacts about uh AI. So I you know I I mostly in the past have worked on uh issues in uh uh you know applying uh deep neural networks to speech and vision but recently I become very interested in the issue of safety and reliability of AI systems. And so I got together with a bunch of collaborators some from Google some from Stanford Berkeley Open AI in the last few months. Um and we recently released this paper called concrete problems in AI safety. um which tries to bring some kind of categorize and bring some precision to the uh to the uh to the to the AI safety discussion. So that's that's what I'm going to be talking about here. Um and so you know pretty much me and all all the collaborators on this project ultimately got interested in this because of our experience working with deep neural networks which we found over the last um few years to you know to have become you know very very powerful and capable but you know also they can be very opaque and can behave in very unpredictable ways. um and all all of us kind of had the same thought that you know as these systems continue to become more powerful and be applied in more kind of safety safety critical states that you know this this is going to become an increasingly important area. So we kind of define in the paper the term accidents as as what it is that we're um worried about which I think is you know the same as what everyone was talking about today and harmful behavior in machine learning systems. Um and you know I I and my co-authors you know kind of we mentioned in the paper that that there are some trends that you know that that we think are really going to raise the importance of accidents and they've been mentioned by some some previous speakers today.
But first of all reinforcement learning which allows increasing intertwinement between the agent and its environment.
all the progress on Atari and on Go. Uh complex agents and environments which just increase the range of strategy the agents can have and the way the environments can be disrupted.
Increasing autonomy which can take the humans out of out of the loop and end to end systems which are often maybe a system might be made up of many deep learning components that talk to each other and it makes it more difficult for a human to understand what's going on and to intervene in the system in some way in some way that's helpful. Um, so you know, we we really with these trends, we really wanted to take kind of a forward-looking approach to, you know, how to think about the safety of AI systems. And, you know, there has been some some public discussion of forward-looking approaches, but I think a lot of it has has focused somewhat unhelpfully on these kind of very extreme scenarios like AGI or super intelligence. And so um we wanted to take an approach that was yes forward-looking but also concrete empirical could be worked on now could relate to systems that you know that uh that we deal with today. Um so one of the one of kind of the main things we do we do in the paper is try and provide a categorization of the ways that accidents could happen. And actually I think you could have many categorizations but you know we're we're just kind of trying to provide one framework among many for for thinking about things. Um so what what happens when when a machine learning system does does something you didn't want it to do?
Um one way to describe it is that the designer had in mind a certain informal notion of the objective or task that it was supposed to accomplish. Um and the actual system failed at this catastrophically. So throughout the paper we use the example of like a cleaning robot whose task is to clean a room. So you know my my informal objective is it should clean the room in a way that when I look at the room at the end I should you know I should be happy with what it's done. Um, and if the robot, you know, like damages property or injures someone or something like that, then, you know, that that was not what I had in mind in my in my informal objective. So, we can kind of classify where did where did things go wrong. So, it could be that you wrote down uh the the incorrect objective function. So, you had an informal objective function in your mind. You turned that into a formal objective function and and no matter no matter how perfect or omniscient your system was at learning and inferring things, it's just the wrong objective function. and it will always do the wrong thing. Um, a second case is where the ex the objective function is expensive. We might know the true objective function.
For instance, we could always consult humans and like ask them to do a detailed evaluation of whether every single action taken by the agent is is the correct one. But but you know it's not practical to do that and so we end up using the objective function only you know only occasionally and you know as a result we can end up doing the wrong thing. And the third class is you have the right objective function but something goes wrong during the learning or inference process. Um there's kind of a fourth category which I don't talk about but which I think is is very important. It's just kind of different different topic from what I'm talking about which is non-ML problems that can arise as part of larger ML or AI systems. Software implementations verification issues security issues that that that sort of stuff. Um, so I'm kind of going to go through the the the three types and because this is a very very brief talk, I'll probably probably go through my my slides pretty pretty quickly, but just want to give a flavor of uh of of what it is that that we're talking about. Um, most of this is is posed mostly but not completely in terms of reinforcement learning agents because we see that as kind of a a good paradigm, but you know, a lot of this stuff we we imagine might might apply more broadly. Um so uh under under the heading of wrong objective functions um we're we we think about kind of two ways two broad ways that you can like easily get the wrong objective function when you meant to write down the right objective function. Um one of those ways is what we call side effects. Um which is that systems that narrowly focus on one aspect of an environment that could be very large in the real world may end up disrupting other parts of the environment by default. Um, and so, you know, I'll expl I'll explain a bit a bit that that that a bit more later, but we kind of have this running example of the cleaning robot. And the idea is that we incentivize the cleaning robot to move a box, and if there's a va vase in its path, and we haven't thought carefully about the vase, it'll just knock over the vase while it's moving the box. And we probably didn't want that. Uh, reward hacking has been mentioned in some of the previous talks. uh this occurs when an objective function can be gamed where where there's a cheap way to achieve the formal objective function without achieving the the ultimate intent of the informal objective function. So example is if we choose to reward our cleaning robot for not seeing dirt, it might close its eyes instead of cleaning because it it just sees no evil, hears no evil, there's no dirt. Um so uh little little bit more on on on side effects to think a little bit more formally about it. you know, we might have an environment where we're trying to do something like, you know, incentivize this to move the box. So, we might naively write down an objective function that maximizes something that refers to only this variable in the environment. Um, and when we do that, we're implicitly saying that we don't care about this other variable in the environment, namely namely the vase. Um what we really wanted was some complex objective function which I'm calling f aware which says like move the box while respecting some like common sense constraints about like the other parts of the environment that are like really hard for me to write down. Um and you know in this particular case well I can just take into account the vase but you know what what if there are 10,000 of these variables and in particular what if the variables I don't care about have more commonality across goals and agents than than the variables that I do care about. that suggests that maybe I can take a kind of uh integrated approach to this. Um so here are just quickly a few of the ideas. One is what we call learning a side effect regularizer. We can think of this as maybe a transfer learning uh uh problem where um you know I learn in one environment and one thing I learn is how to accomplish the goal and the other thing I learn is how not to disrupt the environment. And maybe the information about how not to disrupt the environment is more trans transferable could be a separate you know neural net or something else module that that can be applied to other environments without much or any change even when I have to entirely relearn the goal just just as we do for like in model based reinforcement learning when we when we when we uh repurpose the state state transition model but but maybe not the uh the value function. Um there are information theoretic measures um a little bit a little bit obscure but starting to be to be used more in RL uh that measure kind of intrinsic measures of reward and we've wondered whether those measures like empowerment which which measure the the agents abil uh the agents theoretical ability to control to uh to change its environment could be relevant here. Um we could take a kind of basian approach where we think of the reward function as something that has uncertainty in it and we attempt to learn it. um we could use basian methods and then kind of this thing about side effects is might be something that relates to the prior um another thing is multi-agent approaches uh which I think there's been a fair amount of literature on and has been has been discussed here uh cooperative re inverse reinforcement learning was mentioned once where an an agent and a human work together to optimize the human's values uh mechanism design approaches where instead of thinking of side effects we think of other agents that care about the environment and committing negative externalities against them. Um, uh, so, uh, reward hacking. Here are here are a few things for, uh, for for reward hacking. Just a few ways that you can kind of get the wrong objective function or an objective function that's gameable. Um, there's something called Goodart's law. Um, which I just found out when I was reading about this, which says when a metric becomes a target, it ceases to become a good metric. Um, so for example, in the example of our cleaning robot, if you notice that we could, you know, that how good a job it's doing of cleaning is proportional to how much detergent it uses. And this correlation holds almost perfectly. You could imagine rewarding it that way. And then the agent may respond by dumping out all of its detergent and not cleaning anything. Um, and this this only happens when you optimize on that goal. Um, partial observation is a big one. um goals and rewards in the real world aren't experienced directly, which they're assumed to be experienced directly in kind of the formalism of reinforcement learning. And so there's all kinds of aliasing aliasing things that can happen. Again, example of the robot closing its eyes so that it doesn't see any dirt. And if its reward function isn't constructed carefully, then it will think that its job is done.
um wireheading has been discussed by by Stuart Russell and uh and some people in the ISDIA group um where a reward signal is implemented in the environment and so you can physically tamper with it and replace its own reward function. So there's this kind of like self self-modification aspect to it feedback loops where there's a piece of the reward function that can amplify itself and drown out all other pieces of the reward function. So, so, uh, Leon Bau gives an example of font size and ad placement. Um, where, uh, ads that are more successful, you display in larger fonts because you think people want to see them more. Um, but this is a this is a self-fulfilling feedback loop. And so, you have these giant ads that may not have been popular at all. Um, and simply complicated systems. Um, any any system that's complicated, just like with software, it's increasingly likely to break under optimization pressure. Um some someone showed that with tool assisted speedruns you can uh execute arbitrary code by just pressing the control buttons in uh in Super Mario. Um I don't think an RL agent has ever done that but but you know the environment is such that they could. Um so so some approaches to it something that's been been mentioned a lot by by Eric and a couple other people. Um we shouldn't underestimate just verification model uh uh monitoring isolation unit testing of parts. um uh there's this idea of adversarial reward functions where maybe the reason this hacking is happening is that the objective function is this kind of static thing whereas the agent is this active thing that's trying to like um you know that's trying to outsmart the the reward function in some sense.
So is there some way we didn't get super specific about this that you could use generative adversarial networks or other unsupervised techniques to try and make the the the objective function whatever it is more more more resistant. Um making your reward dependent on your model of the environment instead of something that's given to you is something that can help deal with a partial observation and isolating yourself algorithmically from parts of the environment that that relate to the to to the reward. Um okay uh uh expensive objective functions. Um so here we only have one problem and it's basically the problem that I described.
How do we ensure safe behavior when the true objective function is too expensive to evaluate frequently? So human could tell you what the right thing to do is but you defeat the purpose of automation if you always have to ask the human. So an example is a clean robot might often find uh uh human possessions. It can't ask every time it finds something whether it belongs to a human. Um it might end up throwing out the wrong one.
So it has it has to learn when to ask and when it's when when you know when it should know based on what it's uh what it's asking. So I won't go into this in too much detail. Um but one kind of toy system we start we've started playing with and actually have started doing uh experiments with is is this setup of semi-supervised or or active RL. Um which is the idea that you know there's there's a reward. It's just like RL except there's a there's a reward in the environment which we can think of as you know maybe coming from from human supervision where the agent actually has to actively request the true objective and can only do so every once in a while. Um and so can it do the best it can to kind of like to to you know learn to do the right thing based on that very limited information. And so one one other thing we could we could study and I'm skipping over a lot of stuff is you know will the agent over time as it requests the true objective begin to learn learnable proxies. Say instead of a human giving a detailed evaluation it gives the human a button that it can say yes or no to something and get get a much quicker evaluation. So will the agent learn to develop learnable proxies that are reliable and that are ultimately tethered to to the to the true objective. So all these kind of variations on reinforcement learning where where the supervision is weakened in some way or altered in some way seem like they have uh seem like they have a fair amount of uh promise. Um correct correct objective function problem in learning. Um okay. Um this is a case where um uh you know the the agent if it kind of had had infinite time and correct beliefs would do the right thing but something goes wrong in the meantime. Um, so safe exploration is how do we avoid catastrophic exploratory moves? Um, a robot, we want our robot to experiment with cleaning strategies. We don't want it to stick it a wet mop in electrical socket. We don't want helicopters to crash as one of the previous speakers discussed. Um, distributional shift, which I think has has also been been discussed uh uh previously. How do we detect and respond gracefully to differences between the train and test distribution? Uh, oh yeah, sorry that that example is wrong.
The example I meant to give is like the robot's trained on factory floor. It like uses a bunch of harsh chemicals and like then it works in the office and that that that doesn't make sense. So generally these have gotten a lot of attention and so those those sections of our paper are more kind of reviews but also with pointers to how this can be applied to kind of emerging uh emerging deep learning applications. Um so go just going a little bit a little bit quickly through it. There's substantial prior work in this area. I think there's a lot on adapting risks sensitive performance metrics to deep learning simulated exploration especially with the advent of modelbased reinforcement learning in DRL setups and uh setups that use bounded exploration to to ensure uh uh uh uh er to ensure erodicity. Um ML systems can fail silently and unpredictably when the test distribution differs from the train distribution. This is a Roomba that like ate someone's hair. um uh uh with with classifiers, I mean, in some sense things can't be very bad, but um for for people who remember this, you know, a classifier can do something in incredibly offensive if you if if it's not trained in the right way and you give it the wrong inputs. Uh RL agents can act on really bad uh perceptual info. We can add ad hoc restrictions to make sure that we kind of detect this.
But I think ultimately we need and should be able to have a machine learning solution because you know if you're outside the training distribution you have enough information to detect that your your inputs are different from from where they were before. And there's been a lot of work on this. Um I'm running short on time so I'll just summarize it very quickly but kind of coariant shift assumptions that make assumptions about the probability of the output given the input being the same and only the inputs changes. And the converse of that um there's I think something very interesting is like broadening the train distribution and understanding if if if I train on a few different types of data how much of that does it take to generalize how do I determine when I've generalized well um and also how to respond in agents which I think has been a very very underststudied uh uh topic. There's been a lot of study of like how to detect being out of distribution, how to respond to it in classifiers, but agents can do a whole whole lot of things, right? If I if I see a chemical that I've never seen before, um, you know, then then I I as a human can experiment with that chemical in a controlled environment instead of, you know, even even if that's more expensive instead of doing something else or I can try and access more training data. So, I think I think there's a whole world of of kind of agents remedying their own uncertainty and knowing knowing what they don't know. Um this is um some some stuff we've kind of started proposing experiments which I won't kind of go through in in great detail but the thrust of a lot of these is kind of looking for large suites of simulated environments where we can we can have kind of very broad setups where we can test agents in a way that you know that you know they don't overfit and they can actually be shown to address these tasks. And just as simulated environments in video games have been very important for, you know, Go and Atari and reinforcement learning in general, my hope is they can be very important for safety. And in fact, within a couple days of publishing the paper, some folks at OpenAI implemented a couple of the ideas we had in OpenAI Gym, which is one of their tool open source tools for RL. Uh just just um uh and and so finally since there seems to have been kind of a tradition of everyone mentioning a little bit the the the long-term concerns I did want to give my perspective which is I I do think that kind of long-term concerns about AGI and and and super intelligence can be seen as kind of extreme versions of accident risk. um not Terminator robots taking over, but kind of the more academic writers like like Nick Bostonramm who talk about, you know, if you if you make a make a machine that, you know, makes paper clips for you, maybe maybe it'll make so many paper clips that it it takes over the world.
So, this is kind of a extreme and more fanciful version of uh uh accident risks. I I I think it's too early to work on this directly. I I don't know what such systems would look like and and what to do with them. But what I do feel strongly is that if we take a principled and general approach to the safety of today's systems, that is the correct response to concerns about the far future. So Andrew Ing, who was my my my former boss, had a quote, "Worrying about super intelligence today is like worrying about overpopulation on Mars."
But I' I'd like to propose a bit of a modification to that, even though I think I agree with a lot of it, which is it may not be worth worrying about overpopulation on Mars today, but we can and should study overpopulation on Earth. That is a serious issue we can think about and if we do that right then a lot of what we learn may someday apply to Mars if and when we do go to Mars.
Related Videos
AI Agent Mastery Certification Course: Lab 4 – Tools & MCP
arizeai
350 views•2026-06-16
Real-time Voice cloning, Kimi K2.7 CODE, GLM 5.2 and 3D reconstruction | AI News
kaiexplainsYT
111 views•2026-06-16
He Believes AI Could Replace Humanity Faster Than Anyone Expects
LondonRealTV
815 views•2026-06-15
General Session by Rami Rahim-The next generation of networking: From vision to self-driving reality
HPE
108 views•2026-06-17
[PLDI 2026] Flatirons 3 - LCTES (Jun 16th)
acmsigplan
191 views•2026-06-16
Google DeepMind’s AI Halves UK Housing Planning Time
60secondsignals
467 views•2026-06-17
The Creators of Claude Code and OpenClaw don't Prompt Their Agents Anymore?!
ColeMedin
569 views•2026-06-18
Why prompt injection is AI's biggest fail
usemultiplier
1K views•2026-06-17











