Overview of the projects selected by the participant teams for AISRP 2019.
We include short, informal project reports to capture the participant experience, but also the range of explored ideas and lessons learned. We will extend the list with additional links to future public outputs. Last updated: August 2020.
AI forecasting research agenda
Original summary: Forecasting transformative AI is a unique problem and the notion of AI forecasting is poorly defined and often used very loosely. The research area is in a pre-paradigmatic state and needs a unifying structure for clarifying questions such as this in order to organize thinking about the topic.
This is a significant project due to the broad scope of AI forecasting and we want to start with further surveying the are and collecting positions of stakeholders, as well as developing schemas for surveying, information collection and drawing conclusions.
Aligning AI Ecosystems
Informal project report: AI ecosystems team
Service Systems as a Paradigm for AI Alignment (shared document) and AF linkpost with more discussion.
Ongoing work: Argument for studying AI ecosystems (document), technical problems and emergent properties in AI systems.
Original summary: The paradigm of AI services (including, but not necessarily limited to, ‘comprehensive’ systems) could offer a useful alternative view on AI safety problems. Moreover, the idea of gradual automation of services, and that this process might be problematic, might be more relatable to many mainstream researchers. To become useful, CAIS however needs a list of corresponding technical problems that need to be addressed, and a more precise language (model) to express them.
We would like to find a useful mathematical model of CAIS and formulate a list of technical (safety) problems that need to be addressed to make the system aligned and beneficial. To make sure that the proposed formalization is useful, and draw more attention to it, we would also like to get some progress on one of these technical problems.
Interpretability of deep learning systems
Informal project report: Team Interpretability
What is Interpretability (AF post)
How can Interpretability help Alignment (AF post)
Sparsity and interpretability (AF post)
Ongoing work: Reassessing impact of technical work on contemporary interpretability methods, steel-man variants and inherent limits of contemporary methods. Technical: interpreting RL systems and games (AlphaGo/Leela).
Original summary: An empirical and conceptual investigation into interpretability for current reinforcement learning methods, motivated by how and why interpretability will be useful for long term AI safety.
Understanding how AI makes its decisions, and how it has learned to make these decisions, is a useful tool in ensuring that the AI is safe without having to deploy it. Interpretability in reinforcement learning is relatively underexplored, and poses a range of challenges which aren’t present in a supervised learning setting, which we’re interested in investigating.Read more
Extreme situation synthesis. Also “Heaven / Hell synthesis”. Trying to understand an RL agent by synthesizing a good sample of both ideal and worst environments. This approach has no ambition to fully explain the agent’s behavior, but it has the potential to reveal some aspects of the agent’s understanding of the environment and reward.
Does sparsity imply interpretability? At a high level, we’re interested in what training procedures can increase the interpretability of a deep learning system. Intuitively and anecdotally, sparsity seems like a good candidate and we want to explore this possibility.
Newcomb-like problems in AIXI and Quasi Bayesian Agents
Informal project report: Team Newcomb
Reference Post: Trivial Decision Problem (AF post)
Stuck Exploration (AF post)
Embedded vs. External Decision Problems (AF post)
What makes counterfactuals comparable? (AF post)
Vulnerabilities in CDT and TI-unaware agents (AF post)
Original summary: Investigating how agents that learn the environment, such as AIXI, Quasi Bayesian Agents, and possible modified versions of them, would behave in Newcomb-like situations.
The main question is: How do our present formalisms of agents that learn the environment behave in Newcomb-like situations?Read more
One-shot Newcomb problems. We don’t see one-shot as substantially different from multi-shot. In one-shot as it is typically described, the agent knows that other agents have encountered Newcomb’s so it is very similar to a multishot (there are some complexities there). Simply producing an appropriate formalisation would be worthy of a blog post.
Quasi Bayesian Agents. We suspect that similar work can be done within this formalism of agents that can deal with Knightian uncertainty.
Rationality of agent self-modification
Informal project report: Safe self-modification team
Ongoing work: Finishing paper “Performance of Bounded-Rational Agents With the Ability to Self-Modify” (editing phase) on exponential degradation of ε-optimally optimizing agents with the option to self-modify.
Original summary: When are agents incentivized to self-modify and in what ways?
Behaving like a fully rational agent requires infinite computing power and a perfect model of the world, which is impossible to achieve. A machine intelligence, however powerful, is essentially limited and therefore irrational in some way (probably very differently to humans). In most situations, a rational agent would want to preserve its utility function because if its future self is pursuing the same goals, they will more likely be achieved.Read more
Self-modification and wireheading in simple models. Tom Everitt et al. have shown several classes of agent constructions and proving their behavior wrt. self-modification of their utility and policy, in particular that the realistic agent does not modify its utility. We want to expand on this, exploring wider scenarios and memory and world-model modifications, possibly combined with bounded rationality.
Value synthesis between agents through self-modification. Exploring situations where it may be advantageous for two agents to agree on common values, e.g. to avoid the cost of conflict and mistrust, when is this achievable and how.
Self-modification of bounded-rational agents. While fully rational agents have been shown to want to preserve their values, we have examples of cases where bonded-rational or otherwise limited agents (e.g. subject to noisy value drift or having limited access to the full utility function) have the incentive to self-modify even if they are of the “realistic” type (in Everitt’s formalism).
Safer ML paradigms
Informal project report: Safer ML paradigms and ILP
Ongoing work: Draft of paper “Safety Properties of Inductive Logic Programming”, analyzing applications of ILP in the context of AI alignment.
Original summary: Are there ML paradigms with better safety properties that can substitute or complement DL?
Deep learning has sustained a large increase in capabilities without much progress in its interpretability or verifiability. Since it seems sensible to assign some probability to a “prosaic AGI” scenario, in which DL systems scale to dangerous levels, this is concerning.Read more