AGI Agent Safety by Iteratively Improving the Utility Function
- URL: http://arxiv.org/abs/2007.05411v1
- Date: Fri, 10 Jul 2020 14:30:56 GMT
- Title: AGI Agent Safety by Iteratively Improving the Utility Function
- Authors: Koen Holtman
- Abstract summary: We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function.
We show ongoing work in mapping it to a Causal Influence Diagram (CID)
We then present the design of a learning agent, a design that wraps the safety layer around either a known machine learning system, or a potential future AGI-level learning system.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While it is still unclear if agents with Artificial General Intelligence
(AGI) could ever be built, we can already use mathematical models to
investigate potential safety systems for these agents. We present an AGI safety
layer that creates a special dedicated input terminal to support the iterative
improvement of an AGI agent's utility function. The humans who switched on the
agent can use this terminal to close any loopholes that are discovered in the
utility function's encoding of agent goals and constraints, to direct the agent
towards new goals, or to force the agent to switch itself off. An AGI agent may
develop the emergent incentive to manipulate the above utility function
improvement process, for example by deceiving, restraining, or even attacking
the humans involved. The safety layer will partially, and sometimes fully,
suppress this dangerous incentive. The first part of this paper generalizes
earlier work on AGI emergency stop buttons. We aim to make the mathematical
methods used to construct the layer more accessible, by applying them to an MDP
model. We discuss two provable properties of the safety layer, and show ongoing
work in mapping it to a Causal Influence Diagram (CID). In the second part, we
develop full mathematical proofs, and show that the safety layer creates a type
of bureaucratic blindness. We then present the design of a learning agent, a
design that wraps the safety layer around either a known machine learning
system, or a potential future AGI-level learning system. The resulting agent
will satisfy the provable safety properties from the moment it is first
switched on. Finally, we show how this agent can be mapped from its model to a
real-life implementation. We review the methodological issues involved in this
step, and discuss how these are typically resolved.
Related papers
- AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases [73.04652687616286]
We propose AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base.
Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning.
On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance.
arXiv Detail & Related papers (2024-07-17T17:59:47Z) - AgentGym: Evolving Large Language Model-based Agents across Diverse Environments [116.97648507802926]
Large language models (LLMs) are considered a promising foundation to build such agents.
We take the first step towards building generally-capable LLM-based agents with self-evolution ability.
We propose AgentGym, a new framework featuring a variety of environments and tasks for broad, real-time, uni-format, and concurrent agent exploration.
arXiv Detail & Related papers (2024-06-06T15:15:41Z) - CACA Agent: Capability Collaboration based AI Agent [18.84686313298908]
We propose CACA Agent (Capability Collaboration based AI Agent) using an open architecture inspired by service computing.
CACA Agent integrates a set of collaborative capabilities to implement AI Agents, not only reducing the dependence on a single LLM.
We present a demo to illustrate the operation and the application scenario extension of CACA Agent.
arXiv Detail & Related papers (2024-03-22T11:42:47Z) - Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based
Agents [50.034049716274005]
We take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents.
We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis on the different forms of agent backdoor attacks.
We propose the corresponding data poisoning mechanisms to implement the above variations of agent backdoor attacks on two typical agent tasks.
arXiv Detail & Related papers (2024-02-17T06:48:45Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI)
We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents.
We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z) - On the Use and Misuse of Absorbing States in Multi-agent Reinforcement
Learning [55.95253619768565]
Current MARL algorithms assume that the number of agents within a group remains fixed throughout an experiment.
In many practical problems, an agent may terminate before their teammates.
We present a novel architecture for an existing state-of-the-art MARL algorithm which uses attention instead of a fully connected layer with absorbing states.
arXiv Detail & Related papers (2021-11-10T23:45:08Z) - Counterfactual Planning in AGI Systems [0.0]
Key step in counterfactual planning is to use an AGI machine learning system to construct a counterfactual world model.
A counterfactual planning agent determines the action that best maximizes expected utility in this counterfactual planning world.
We use counterfactual planning to construct an AGI agent emergency stop button, and a safety interlock that will automatically stop the agent before it undergoes an intelligence explosion.
arXiv Detail & Related papers (2021-01-29T13:44:14Z) - A Metamodel and Framework for AGI [3.198144010381572]
We introduce the Deep Fusion Reasoning Engine (DFRE), which implements a knowledge-preserving metamodel and framework for constructing applied AGI systems.
DFRE exhibits some important fundamental knowledge properties such as clear distinctions between symmetric and antisymmetric relations.
Our experiments show that the proposed framework achieves 94% accuracy on average on unsupervised object detection and recognition.
arXiv Detail & Related papers (2020-08-28T23:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.