AISRP 2019 projects and results – AI Safety Research Program

Overview of the projects selected by the participant teams for AISRP 2019.

We include short, informal project reports to capture the participant experience, but also the range of explored ideas and lessons learned. We will extend the list with additional links to future public outputs. Last updated: August 2020.

AI forecasting research agenda

Informal project report: Forecasting team
Published outputs:
Forecasting AI Progress: A Research Agenda (arXiv, submitted to journal, AF post)

Original summary: Forecasting transformative AI is a unique problem and the notion of AI forecasting is poorly defined and often used very loosely. The research area is in a pre-paradigmatic state and needs a unifying structure for clarifying questions such as this in order to organize thinking about the topic.

This is a significant project due to the broad scope of AI forecasting and we want to start with further surveying the are and collecting positions of stakeholders, as well as developing schemas for surveying, information collection and drawing conclusions.

Aligning AI Ecosystems

Informal project report: AI ecosystems team
Published outputs:
Service Systems as a Paradigm for AI Alignment (shared document) and AF linkpost with more discussion.
Ongoing work: Argument for studying AI ecosystems (document), technical problems and emergent properties in AI systems.

Original summary: The paradigm of AI services (including, but not necessarily limited to, ‘comprehensive’ systems) could offer a useful alternative view on AI safety problems. Moreover, the idea of gradual automation of services, and that this process might be problematic, might be more relatable to many mainstream researchers. To become useful, CAIS however needs a list of corresponding technical problems that need to be addressed, and a more precise language (model) to express them.

We would like to find a useful mathematical model of CAIS and formulate a list of technical (safety) problems that need to be addressed to make the system aligned and beneficial. To make sure that the proposed formalization is useful, and draw more attention to it, we would also like to get some progress on one of these technical problems.

Interpretability of deep learning systems

Informal project report: Team Interpretability
Published outputs:
What is Interpretability (AF post)
How can Interpretability help Alignment (AF post)
Sparsity and interpretability (AF post)
Ongoing work: Reassessing impact of technical work on contemporary interpretability methods, steel-man variants and inherent limits of contemporary methods. Technical: interpreting RL systems and games (AlphaGo/Leela).

Original summary: An empirical and conceptual investigation into interpretability for current reinforcement learning methods, motivated by how and why interpretability will be useful for long term AI safety.

Understanding how AI makes its decisions, and how it has learned to make these decisions, is a useful tool in ensuring that the AI is safe without having to deploy it. Interpretability in reinforcement learning is relatively underexplored, and poses a range of challenges which aren’t present in a supervised learning setting, which we’re interested in investigating.

Explaining decisions. Finding which data points in the training environment have given a policy evidence that the decision it has made is the right one. One framing for this is “debugging of AI training” – when we find a problematic input+output situation, we can see why it was learned.

Extreme situation synthesis. Also “Heaven / Hell synthesis”. Trying to understand an RL agent by synthesizing a good sample of both ideal and worst environments. This approach has no ambition to fully explain the agent’s behavior, but it has the potential to reveal some aspects of the agent’s understanding of the environment and reward.

Does sparsity imply interpretability? At a high level, we’re interested in what training procedures can increase the interpretability of a deep learning system. Intuitively and anecdotally, sparsity seems like a good candidate and we want to explore this possibility.

Newcomb-like problems in AIXI and Quasi Bayesian Agents

Informal project report: Team Newcomb
Published outputs:
Reference Post: Trivial Decision Problem (AF post)
Stuck Exploration (AF post)
Embedded vs. External Decision Problems (AF post)
What makes counterfactuals comparable? (AF post)
Vulnerabilities in CDT and TI-unaware agents (AF post)

Original summary: Investigating how agents that learn the environment, such as AIXI, Quasi Bayesian Agents, and possible modified versions of them, would behave in Newcomb-like situations.

The main question is: How do our present formalisms of agents that learn the environment behave in Newcomb-like situations?

Iterated Newcomb problems. Formalize plausible environments for each Newcomb-like problem and tie them up against AIXI, and see which environment in Solomonoff’s prior it converges to. If the result depends on the choice of universal Turing machine, we want to explore what causes the dependence exactly, or try removing traps from AIXI’s prior.

One-shot Newcomb problems. We don’t see one-shot as substantially different from multi-shot. In one-shot as it is typically described, the agent knows that other agents have encountered Newcomb’s so it is very similar to a multishot (there are some complexities there). Simply producing an appropriate formalisation would be worthy of a blog post.

Quasi Bayesian Agents. We suspect that similar work can be done within this formalism of agents that can deal with Knightian uncertainty.

Rationality of agent self-modification

Informal project report: Safe self-modification team
Published outputs:
Performance of Bounded-Rational Agents With the Ability to Self-Modify (arXiv, accepted to AAAI 2021 workshop SafeAI 2021)

Original summary: When are agents incentivized to self-modify and in what ways?

Behaving like a fully rational agent requires infinite computing power and a perfect model of the world, which is impossible to achieve. A machine intelligence, however powerful, is essentially limited and therefore irrational in some way (probably very differently to humans). In most situations, a rational agent would want to preserve its utility function because if its future self is pursuing the same goals, they will more likely be achieved.

However, in some cases modifying one’s utility function might prove instrumental and make it more likely that current goals are achieved. Such situations include value synthesis (two agents with similar utility functions merging to a ‘compromise’ utility function, so that they can cooperate better), selecting more robust value representation, or open-source game-theoretic situations (avoiding blackmail, precommitment in non-rational agents to avoid akrasia etc.).

Self-modification and wireheading in simple models. Tom Everitt et al. have shown several classes of agent constructions and proving their behavior wrt. self-modification of their utility and policy, in particular that the realistic agent does not modify its utility. We want to expand on this, exploring wider scenarios and memory and world-model modifications, possibly combined with bounded rationality.

Value synthesis between agents through self-modification. Exploring situations where it may be advantageous for two agents to agree on common values, e.g. to avoid the cost of conflict and mistrust, when is this achievable and how.

Self-modification of bounded-rational agents. While fully rational agents have been shown to want to preserve their values, we have examples of cases where bonded-rational or otherwise limited agents (e.g. subject to noisy value drift or having limited access to the full utility function) have the incentive to self-modify even if they are of the “realistic” type (in Everitt’s formalism).

Safer ML paradigms

Informal project report: Safer ML paradigms and ILP
Published outputs:
Safety Properties of Inductive Logic Programming (accepted to AAAI 2021 workshop SafeAI 2021)

Original summary: Are there ML paradigms with better safety properties that can substitute or complement DL?

Deep learning has sustained a large increase in capabilities without much progress in its interpretability or verifiability. Since it seems sensible to assign some probability to a “prosaic AGI” scenario, in which DL systems scale to dangerous levels, this is concerning.

One less explored paradigm that we want to explore for its potential capabilities and contribution to AI safety is Inductive Logic Programming, a descendent of the McCarthy logic programme. This can learn programs from a small number of examples using backchaining and results in (in-principle) human-readable and human-fixable models. Its time complexity and intolerance of ambiguity has left it with few practical applications, and without many theorists. Several very recent promising developments claim a dramatic improvement over classical ILP systems, e.g. higher-order programs, probabilistic ILP, hybridising DL and ILP.