Overview of the projects selected by the participant teams for AISRP 2019.
We include short, informal project reports to capture the participant experience, but also the range of explored ideas and lessons learned. We will extend the list with additional links to future public outputs.
AI forecasting research agenda
Project report: Forecasting team
Feb 2020: Processing the performed survey, methodology research, preparing publication.
June 2020: Finishing extensive publication of survey results.
Journal publication in preparation (preprint ETA July-Aug 2020)
Forecasting transformative AI is a unique problem and the notion of AI forecasting is poorly defined and often used very loosely. The research area is in a pre-paradigmatic state and needs a unifying structure for clarifying questions such as this in order to organize thinking about the topic.
This is a significant project due to the broad scope of AI forecasting and we want to start with further surveying the are and collecting positions of stakeholders, as well as developing schemas for surveying, information collection and drawing conclusions.
Aligning AI Ecosystems
Project report: Here.
Feb 2020: Concrete models in several domains, Near term considerations (pre-comprehensive AI services ecosystem).
Service Systems as a Paradigm for AI Alignment (shared document) and AF linkpost with more discussion.
The paradigm of AI services (including, but not necessarily limited to, ‘comprehensive’ systems) could offer a useful alternative view on AI safety problems. Moreover, the idea of gradual automation of services, and that this process might be problematic, might be more relatable to many mainstream researchers. To become useful, CAIS however needs a list of corresponding technical problems that need to be addressed, and a more precise language (model) to express them.
We would like to find a useful mathematical model of CAIS and formulate a list of technical (safety) problems that need to be addressed to make the system aligned and beneficial. To make sure that the proposed formalization is useful, and draw more attention to it, we would also like to get some progress on one of these technical problems.
Interpretability of deep learning systems
Project report: Team Interpretability
Feb 2020: Explaining RL and MCTS decisions (e.g. Go), AGI interpretability desiderata, overview of recent interpretability results
June 2020: Limits of interpretability of strong AI systems, steel-man versions of current methods.
What is Interpretability (AF post)
How can Interpretability help Alignment (AF post)
Sparsity and interpretability (AF post)
An empirical and conceptual investigation into interpretability for current reinforcement learning methods, motivated by how and why interpretability will be useful for long term AI safety.
Understanding how AI makes its decisions, and how it has learned to make these decisions, is a useful tool in ensuring that the AI is safe without having to deploy it. Interpretability in reinforcement learning is relatively underexplored, and poses a range of challenges which aren’t present in a supervised learning setting, which we’re interested in investigating.Read more
Extreme situation synthesis. Also “Heaven / Hell synthesis”. Trying to understand an RL agent by synthesizing a good sample of both ideal and worst environments. This approach has no ambition to fully explain the agent’s behavior, but it has the potential to reveal some aspects of the agent’s understanding of the environment and reward.
Does sparsity imply interpretability? At a high level, we’re interested in what training procedures can increase the interpretability of a deep learning system. Intuitively and anecdotally, sparsity seems like a good candidate and we want to explore this possibility.
Newcomb-like problems in AIXI and Quasi Bayesian Agents
Project report: Team Newcomb
Feb 2020: Expanding several posts on decision theory, agent embededness and quasi-Bayesian agents.
Reference Post: Trivial Decision Problem (AF post)
Stuck Exploration (AF post)
Embedded vs. External Decision Problems (AF post)
What makes counterfactuals comparable? (AF post)
Vulnerabilities in CDT and TI-unaware agents (AF post)
Investigating how agents that learn the environment, such as AIXI, Quasi Bayesian Agents, and possible modified versions of them, would behave in Newcomb-like situations.
The main question is: How do our present formalisms of agents that learn the environment behave in Newcomb-like situations?Read more
One-shot Newcomb problems. We don’t see one-shot as substantially different from multi-shot. In one-shot as it is typically described, the agent knows that other agents have encountered Newcomb’s so it is very similar to a multishot (there are some complexities there). Simply producing an appropriate formalisation would be worthy of a blog post.
Quasi Bayesian Agents. We suspect that similar work can be done within this formalism of agents that can deal with Knightian uncertainty.
Rationality of agent self-modification
Project report: Team Safe modification
Feb 2020: Writing and expanding results on ε-optimally optimizing agents with almost perfect beliefs about world model and/or their utility.
June 2020: Finishing paper for publication.
Paper in active preparation (ETA July-Aug 2020)
When are agents incentivized to self-modify and in what ways?
Behaving like a fully rational agent requires infinite computing power and a perfect model of the world, which is impossible to achieve. A machine intelligence, however powerful, is essentially limited and therefore irrational in some way (probably very differently to humans). In most situations, a rational agent would want to preserve its utility function because if its future self is pursuing the same goals, they will more likely be achieved.Read more
Self-modification and wireheading in simple models. Tom Everitt et al. have shown several classes of agent constructions and proving their behavior wrt. self-modification of their utility and policy, in particular that the realistic agent does not modify its utility. We want to expand on this, exploring wider scenarios and memory and world-model modifications, possibly combined with bounded rationality.
Value synthesis between agents through self-modification. Exploring situations where it may be advantageous for two agents to agree on common values, e.g. to avoid the cost of conflict and mistrust, when is this achievable and how.
Self-modification of bounded-rational agents. While fully rational agents have been shown to want to preserve their values, we have examples of cases where bonded-rational or otherwise limited agents (e.g. subject to noisy value drift or having limited access to the full utility function) have the incentive to self-modify even if they are of the “realistic” type (in Everitt’s formalism).
Safer ML paradigms
Project report: Safer ML paradigms and ILP
Feb 2020: Theoretical feasibility and safety properties of strong logic based AI systems, experimental work on Inductive Logic Programming.
June 2020: Ongoing work work on a summary paper Safety Properties of ILP.
Links to public outputs: WIP
Are there ML paradigms with better safety properties that can substitute or complement DL?
Deep learning has sustained a large increase in capabilities without much progress in its interpretability or verifiability. Since it seems sensible to assign some probability to a “prosaic AGI” scenario, in which DL systems scale to dangerous levels, this is concerning.Read more