Identifying and Addressing Delusions for Target-Directed Decision-Making
- URL: http://arxiv.org/abs/2410.07096v5
- Date: Mon, 18 Nov 2024 17:40:57 GMT
- Title: Identifying and Addressing Delusions for Target-Directed Decision-Making
- Authors: Mingde Zhao, Tristan Sylvain, Doina Precup, Yoshua Bengio,
- Abstract summary: We show that target-directed agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes.
We show that these behaviors can be results of delusions, stemming from improper designs around training.
We demonstrate how we can make agents address delusions preemptively and autonomously.
- Score: 81.22463009144987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Target-directed agents utilize self-generated targets, to guide their behaviors for better generalization. These agents are prone to blindly chasing problematic targets, resulting in worse generalization and safety catastrophes. We show that these behaviors can be results of delusions, stemming from improper designs around training: the agent may naturally come to hold false beliefs about certain targets. We identify delusions via intuitive examples in controlled environments, and investigate their causes and mitigations. With the insights, we demonstrate how we can make agents address delusions preemptively and autonomously. We validate empirically the effectiveness of the proposed strategies in correcting delusional behaviors and improving out-of-distribution generalization.
Related papers
- Interpreting Emergent Planning in Model-Free Reinforcement Learning [13.820891288919002]
We present the first evidence that model-free reinforcement learning agents can learn to plan.
This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban.
arXiv Detail & Related papers (2025-04-02T16:24:23Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)
This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.
We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography [21.632703081999036]
Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems.
We propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs.
arXiv Detail & Related papers (2024-05-23T04:52:02Z) - Deception in Reinforced Autonomous Agents [30.510998478048723]
We explore the ability of large language model (LLM)-based agents to engage in subtle deception.
This behavior can be hard to detect, unlike blatant lying or unintentional hallucination.
We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
arXiv Detail & Related papers (2024-05-07T13:55:11Z) - Analyzing Intentional Behavior in Autonomous Agents under Uncertainty [3.0099979365586265]
Principled accountability for autonomous decision-making in uncertain environments requires distinguishing intentional outcomes from negligent designs from actual accidents.
We propose analyzing the behavior of autonomous agents through a quantitative measure of the evidence of intentional behavior.
In a case study, we show how our method can distinguish between 'intentional' and 'accidental' traffic collisions.
arXiv Detail & Related papers (2023-07-04T07:36:11Z) - Power-seeking can be probable and predictive for trained agents [3.616948583169635]
Power-seeking behavior is a key source of risk from advanced AI.
We investigate how the training process affects power-seeking incentives.
We show that power-seeking incentives can be probable and predictive.
arXiv Detail & Related papers (2023-04-13T13:29:01Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - Order-Disorder: Imitation Adversarial Attacks for Black-box Neural
Ranking Models [48.93128542994217]
We propose an imitation adversarial attack on black-box neural passage ranking models.
We show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates.
We also propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers.
arXiv Detail & Related papers (2022-09-14T09:10:07Z) - A Tale of HodgeRank and Spectral Method: Target Attack Against Rank
Aggregation Is the Fixed Point of Adversarial Game [153.74942025516853]
The intrinsic vulnerability of the rank aggregation methods is not well studied in the literature.
In this paper, we focus on the purposeful adversary who desires to designate the aggregated results by modifying the pairwise data.
The effectiveness of the suggested target attack strategies is demonstrated by a series of toy simulations and several real-world data experiments.
arXiv Detail & Related papers (2022-09-13T05:59:02Z) - On Almost-Sure Intention Deception Planning that Exploits Imperfect
Observers [24.11353445650682]
Intention deception involves computing a strategy which deceives the opponent into a wrong belief about the agent's intention or objective.
This paper studies a class of probabilistic planning problems with intention deception and investigates how a defender's limited sensing modality can be exploited.
arXiv Detail & Related papers (2022-09-01T16:38:03Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - Path-Specific Objectives for Safer Agent Incentives [15.759504531768219]
We describe settings with 'delicate' parts of the state which should not be used as a means to an end.
We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state.
The resulting agents have no incentive to control the delicate state.
arXiv Detail & Related papers (2022-04-21T11:01:31Z) - Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual
Patterns [18.694795507945603]
Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks.
This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical object in the environment.
arXiv Detail & Related papers (2021-09-16T04:59:06Z) - Deceptive Decision-Making Under Uncertainty [25.197098169762356]
We study the design of autonomous agents that are capable of deceiving outside observers about their intentions while carrying out tasks.
By modeling the agent's behavior as a Markov decision process, we consider a setting where the agent aims to reach one of multiple potential goals.
We propose a novel approach to model observer predictions based on the principle of maximum entropy and to efficiently generate deceptive strategies.
arXiv Detail & Related papers (2021-09-14T14:56:23Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z) - Guided Adversarial Attack for Evaluating and Enhancing Adversarial
Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training.
We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries.
We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z) - Intrinsic Motivation for Encouraging Synergistic Behavior [55.10275467562764]
We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks.
Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own.
arXiv Detail & Related papers (2020-02-12T19:34:51Z) - Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior.
As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones.
We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.