Related papers: Identifying and Addressing Delusions for Target-Directed Decision-Making

Identifying and Addressing Delusions for Target-Directed Decision-Making

URL: http://arxiv.org/abs/2410.07096v4
Date: Wed, 16 Oct 2024 18:46:35 GMT
Title: Identifying and Addressing Delusions for Target-Directed Decision-Making
Authors: Mingde Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio,
Abstract summary: We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We identify different types of delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
Score: 81.22463009144987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Target-directed agents utilize self-generated targets, to guide their behaviors for better generalization. These agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training: the agent may naturally come to hold false beliefs about certain targets. We identify different types of delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. With the insights, we demonstrate how we can make agents address delusions preemptively and autonomously. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.

Related papers

Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Analyzing Intentional Behavior in Autonomous Agents under Uncertainty [3.0099979365586265]
Principled accountability for autonomous decision-making in uncertain environments requires distinguishing intentional outcomes from negligent designs from actual accidents. We propose analyzing the behavior of autonomous agents through a quantitative measure of the evidence of intentional behavior. In a case study, we show how our method can distinguish between 'intentional' and 'accidental' traffic collisions.
arXiv Detail & Related papers (2023-07-04T07:36:11Z)
Power-seeking can be probable and predictive for trained agents [3.616948583169635]
Power-seeking behavior is a key source of risk from advanced AI. We investigate how the training process affects power-seeking incentives. We show that power-seeking incentives can be probable and predictive.
arXiv Detail & Related papers (2023-04-13T13:29:01Z)
Emergent Behaviors in Multi-Agent Target Acquisition [0.0]
We simulate a Multi-Agent System (MAS) using Reinforcement Learning (RL) in a pursuit-evasion game. We create different adversarial scenarios by replacing RL-trained pursuers' policies with two distinct (non-RL) analytical strategies. The novelty of our approach entails the creation of an influential feature set that reveals underlying data regularities.
arXiv Detail & Related papers (2022-12-15T15:20:58Z)
Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process. We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z)
Path-Specific Objectives for Safer Agent Incentives [15.759504531768219]
We describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state. The resulting agents have no incentive to control the delicate state.
arXiv Detail & Related papers (2022-04-21T11:01:31Z)
Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns [18.694795507945603]
Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks. This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical object in the environment.
arXiv Detail & Related papers (2021-09-16T04:59:06Z)
End-to-End Learning and Intervention in Games [60.41921763076017]
We provide a unified framework for learning and intervention in games. We propose two approaches, respectively based on explicit and implicit differentiation. The analytical results are validated using several real-world problems.
arXiv Detail & Related papers (2020-10-26T18:39:32Z)
Intrinsic Motivation for Encouraging Synergistic Behavior [55.10275467562764]
We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own.
arXiv Detail & Related papers (2020-02-12T19:34:51Z)
Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones. We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.