Safe self-modification team: the story

This is a summary of the project the team studying how agents may decide or decide against self-modifying their incentives.

Original idea and motivation

The first question we asked was “Under what circumstances is it beneficial for an intelligent agent to modify its utility function?” The reason we considered this important is that a future advanced artificial intelligence will likely have the ability to self-modify. This could be by directly rewriting its source code or if this is not possible, it can take actions in the world which lead into its modification, for example manipulating people into modifying it. If the initial goals of the AI are aligned with human values, changes to the utility function can lead into unwanted behaviour. Conversely, if we figure out under what circumstances the agent is willing to have its goals changed, this can enable us to correct the goals of a misaligned AI after deployment.

If an agent continues to pursue the same goals in the future, it is more likely that the current goals will be achieved. Therefore, under most circumstances, an agent would want to preserve its utility function. We decided to look for exceptions to this default behaviour.

At the first retreat

We tried to generate a list of situations in which an agent might want to change its utility function. We came up with the following ideas:

  • Blackmail: other agents threatening that if our agent doesn’t self-modify, they will destroy something the agent currently cares about.
  • Precommitment mechanism: an agent can prove to other agents that it will take certain actions in the future by adding a strong reward for those actions into its utility function, or using an (external) penalty.
  • Utility merging: two agents with similar goals might want to agree on a compromise utility function which they will begin to optimise from now on. This way, they will avoid wasting resources on fighting against each other and instead work towards their shared goals.
  • Bounded rationality: agents which are not entirely rational might want to modify their utility function to encourage a certain future behaviour. For example, humans are more likely to stick to a new habit if they publicly announce their plan because having to admit to their friends that they failed is unpleasant. This is equivalent to adding an extra penalty into the utility function for failing to take certain action in the future.
  • Saving computing power: a simpler utility function might be easier to optimise for, or actually get optimised in a more robust or optimal way.
  • Hedonism: an agent might (eventually) change its utility function to something which is easy to achieve and then only e.g. strive for survival/existence. How to prevent this?

Between retreats

Because many of these situations point to different topics and studying all of them at once would be too ambitious, we decided to narrow down our focus to a more specific question. 

We had some ideas about utility merging: How is the compromise utility function to be determined? What mechanism can the agents use to gradually converge to a shared utility function? Is it necessary to merge utilities to avoid fighting and wasting resources? We decided not to work on these questions because we didn’t have a concrete model for studying these and the questions seemed less tractable. We were also unsure about their relevance to AI safety.

Another idea was to continue the work started in this paper and see if the same results apply to different types of modification than utility function modification. This problem seemed more tractable because we could use the same model as is used in the paper and ask specific questions which can be answered with mathematical proofs.

At the second retreat

We changed our focus from just studying utility function modification to self-modification in general. We studied memory modification (what the agent believes about the history) and belief modification (the agents model of the world). It turned out that what Everitt and al. proved about policy modification could be extended to these other types of modification as well. We concluded that perfectly rational utility maximizers will never want to self-modify in a dangerous way. However, our results felt intuitively obvious, not very useful for the real world and the proofs seemed almost trivial.

After a brainstorming session about what to do next, we decided to study agents which are nearly rational. One way to define this is that over the whole future their policy is always able to achieve utility only epsilon (some small number) far from the maximum possible utility. In a similar way, we can study agents which are perfect utility maximizers but their utility function deviates from the utility function of humans by at most epsilon. Finally, we studied agents whose model of the world is inaccurate by at most epsilon. Questions related to these agents were of the right difficulty: less trivial but still tractable.


We derived an upper bound on the utility loss due to the policy being suboptimal. We found a similar bound for perfect utility maximizers with epsilon-inaccurate world model and for agents with utility function which is misaligned by at most epsilon.

We showed that if the agent has the ability to self-modify and future utility is discounted by a factor of gamma, this leads to potentially very bad behaviour in the long run if the agent’s policy is suboptimal. For arbitrarily small epsilon, the behaviour becomes arbitrarily bad in the distant future.

The situation is different for agents which are perfect utility maximizers but optimize for the wrong utility function or with an inaccurate world model. If the inaccuracy in the utility function or the world model is sufficiently small, this leads to finite loss in utility with an upper bound. This result suggests that having an inaccurate model of the world or having a slightly misaligned utility function is preferable to having an imperfect decision-making algorithm.