Interpret Ability team: the story – AI Safety Research Program

Team Interpret Ability is interested in what interpretability for (a potentially strong) artificial intelligence is on high level – what it means to interpret a model or a reinforcement learning (RL) agent, and how we can go about creating such interpretation.

Our team formed at the initial research retreat in November around shared interest in designing methods to enable interpretation of reinforcement learning agents, understanding what links there are between sparsity and interpretability, and thinking about how interpretability relates to AI safety and alignment.

We first created 5 initial research proposals in the area:

Investigating the relation between sparsity and interpretability (also inspired by the interpretability of symbolic reasoning).
Understanding desiderata for interpretability for AI safety and how interpretability methods would scale to highly complex systems.
Explaining the decisions of RL agents using previous experiences, for example thinking about explaining AlphaGo moves.
Incentivizing transparent agents to be interpretable or obfuscated to each other by constructing game environments (e.g. if your inner thinking is apparently very complex, it may be harder to trust you.), inspired by open-source game theory.
Synthesising good and bad world states for RL agents (“Heaven/Hell synthesis”) to gain information about their goals and utility.

After gathering feedback from other participants, and with our own thoughts, we decided to focus on the questions of sparsity, explaining decisions and desiderata for interpretability. In the weeks between the first and second retreat, we worked on a variety of machine learning experiments and ideas.

We produced flow-graphs from pruned (and hence sparse) MNIST classifiers to gain some intuition about the links between sparsity and interpretability.
We also created a method to measure the importance of training samples on a test-prediction of a model by comparing the gradient of the loss on the two samples.
The desiderata for interpretability for AI safety was also a topic that we had some thoughts on and developed a few ideas, but it was very exploratory, and more for our own deconfusion than for others.

During this time we also came up with a new idea: Trying to interpret the differences between 2 agents may be easier than interpreting the agents directly, and still useful in e.g. expert iteration or IDA. If the difference was easier to understand or predict than the 2 agents individually, then we can gain more insight into the agents. In particular, this is interesting if one of the agents is human. It even seems that for some games like Go, there is enough human-generated data to make a meaningful comparison.

Leading up to the second retreat we focused our thoughts down to two ideas: Interpreting the differences between agents, and explaining decisions in reinforcement learning. We didn’t manage to focus on one idea before arriving at the retreat, so decided to spend the first portion of the retreat deciding between the ideas. At the retreat, we started by thinking hard about how to somehow rigorously define interpretability. Having a (semi-)formal framing and definition, and a collection of goals or attributes of different methods, helped us discuss research clearly during the rest of the retreat, and motivate our idea generation.

After this, we worked on three different technical ideas or experiments over the course of the retreat. The first was on the sparsity and interpretability project, where we thought about applying state of the art interpretability methods to sparsified networks and seeing the results. This yielded an idea for a concrete experiment which we hope to pursue after the retreat. The second was on explaining decisions in RL, and focused around understanding the similarities and differences between the gradient similarity method we had created and a method from the literature called Influence Functions. We discovered they were more similar than we thought, and this helped us understand both of them more. Finally, we thought about how to understand and interpret the metric space of activations of a neural network layer by layer, to see how each layer morphs and changes the input space into something using for its task. We produced some visualisations of this space, and plan to continue working on this method and hoping to gain insight into the differences between different network architectures and datasets using it.

Our plan for after the retreat is threefold:

Most generally, we’re working on an internal interpretability research agenda, to capture all our ideas and possible research directions. This will ensure we don’t forget promising directions, and are able to do as much fruitful research as possible.

Secondly, we plan to write two blog posts on work that we’ve mostly completed: The generation of flow graphs from sparse networks we did before the retreat, and the framing of interpretability we built at the beginning. This second blog post will also involve thinking about how this framing relates to alignment.

Finally, we plan to continue working on the visualisations of the network’s internal activations, as well the experiments on the relation between sparsity and interpretability, and explaining decisions of RL agents by interpreting the mechanism by which they make their decisions.

Overall, the research program has been a great opportunity to generate new research ideas, more than we’ve even been able to describe in this text.