Rejecting Hallucinated State Targets during Planning
- URL: http://arxiv.org/abs/2410.07096v8
- Date: Wed, 04 Jun 2025 20:43:13 GMT
- Title: Rejecting Hallucinated State Targets during Planning
- Authors: Mingde Zhao, Tristan Sylvain, Romain Laroche, Doina Precup, Yoshua Bengio,
- Abstract summary: This work first categorizes and investigates the properties of several kinds of infeasible targets.<n>We devise a strategy to reject infeasible targets with a generic target evaluator.<n>We highlight that, without proper design, the evaluator can produce delusional estimates, rendering the strategy futile.
- Score: 84.179112256683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models can be used in planning to propose targets corresponding to states that agents deem either likely or advantageous to experience. However, imperfections, common in learned models, lead to infeasible hallucinated targets, which can cause delusional behaviors and thus safety concerns. This work first categorizes and investigates the properties of several kinds of infeasible targets. Then, we devise a strategy to reject infeasible targets with a generic target evaluator, which trains alongside planning agents as an add-on without the need to change the behavior nor the architectures of the agent (and the generative model) it is attached to. We highlight that, without proper design, the evaluator can produce delusional estimates, rendering the strategy futile. Thus, to learn correct evaluations of infeasible targets, we propose to use a combination of learning rule, architecture, and two assistive hindsight relabeling strategies. Our experiments validate significant reductions in delusional behaviors and performance improvements for several kinds of existing planning agents.
Related papers
- Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z) - Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach [26.259289475583522]
Multi-target adversarial attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously.<n>To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks.<n>We propose the 2D-TGAF framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors.
arXiv Detail & Related papers (2025-04-19T02:08:48Z) - Interpreting Emergent Planning in Model-Free Reinforcement Learning [13.820891288919002]
We present the first evidence that model-free reinforcement learning agents can learn to plan.
This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban.
arXiv Detail & Related papers (2025-04-02T16:24:23Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)
This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.
We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography [21.632703081999036]
Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems.
We propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs.
arXiv Detail & Related papers (2024-05-23T04:52:02Z) - Deception in Reinforced Autonomous Agents [30.510998478048723]
We explore the ability of large language model (LLM)-based agents to engage in subtle deception.
This behavior can be hard to detect, unlike blatant lying or unintentional hallucination.
We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
arXiv Detail & Related papers (2024-05-07T13:55:11Z) - Analyzing Intentional Behavior in Autonomous Agents under Uncertainty [3.0099979365586265]
Principled accountability for autonomous decision-making in uncertain environments requires distinguishing intentional outcomes from negligent designs from actual accidents.
We propose analyzing the behavior of autonomous agents through a quantitative measure of the evidence of intentional behavior.
In a case study, we show how our method can distinguish between 'intentional' and 'accidental' traffic collisions.
arXiv Detail & Related papers (2023-07-04T07:36:11Z) - Power-seeking can be probable and predictive for trained agents [3.616948583169635]
Power-seeking behavior is a key source of risk from advanced AI.
We investigate how the training process affects power-seeking incentives.
We show that power-seeking incentives can be probable and predictive.
arXiv Detail & Related papers (2023-04-13T13:29:01Z) - Towards Reasonable Budget Allocation in Untargeted Graph Structure
Attacks via Gradient Debias [50.628150015907565]
Cross-entropy loss function is used to evaluate perturbation schemes in classification tasks.
Previous methods use negative cross-entropy loss as the attack objective in attacking node-level classification models.
This paper argues about the previous unreasonable attack objective from the perspective of budget allocation.
arXiv Detail & Related papers (2023-03-29T13:02:02Z) - Learning to Generate All Feasible Actions [4.333208181196761]
We introduce action mapping, a novel approach that divides the learning process into two steps: first learn feasibility and subsequently, the objective.
This paper focuses on the feasibility part by learning to generate all feasible actions through self-supervised querying of the feasibility model.
We demonstrate the agent's proficiency in generating actions across disconnected feasible action sets.
arXiv Detail & Related papers (2023-01-26T23:15:51Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - Order-Disorder: Imitation Adversarial Attacks for Black-box Neural
Ranking Models [48.93128542994217]
We propose an imitation adversarial attack on black-box neural passage ranking models.
We show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates.
We also propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers.
arXiv Detail & Related papers (2022-09-14T09:10:07Z) - A Tale of HodgeRank and Spectral Method: Target Attack Against Rank
Aggregation Is the Fixed Point of Adversarial Game [153.74942025516853]
The intrinsic vulnerability of the rank aggregation methods is not well studied in the literature.
In this paper, we focus on the purposeful adversary who desires to designate the aggregated results by modifying the pairwise data.
The effectiveness of the suggested target attack strategies is demonstrated by a series of toy simulations and several real-world data experiments.
arXiv Detail & Related papers (2022-09-13T05:59:02Z) - On Almost-Sure Intention Deception Planning that Exploits Imperfect
Observers [24.11353445650682]
Intention deception involves computing a strategy which deceives the opponent into a wrong belief about the agent's intention or objective.
This paper studies a class of probabilistic planning problems with intention deception and investigates how a defender's limited sensing modality can be exploited.
arXiv Detail & Related papers (2022-09-01T16:38:03Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - Path-Specific Objectives for Safer Agent Incentives [15.759504531768219]
We describe settings with 'delicate' parts of the state which should not be used as a means to an end.
We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state.
The resulting agents have no incentive to control the delicate state.
arXiv Detail & Related papers (2022-04-21T11:01:31Z) - Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual
Patterns [18.694795507945603]
Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks.
This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical object in the environment.
arXiv Detail & Related papers (2021-09-16T04:59:06Z) - Deceptive Decision-Making Under Uncertainty [25.197098169762356]
We study the design of autonomous agents that are capable of deceiving outside observers about their intentions while carrying out tasks.
By modeling the agent's behavior as a Markov decision process, we consider a setting where the agent aims to reach one of multiple potential goals.
We propose a novel approach to model observer predictions based on the principle of maximum entropy and to efficiently generate deceptive strategies.
arXiv Detail & Related papers (2021-09-14T14:56:23Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z) - SPOTTER: Extending Symbolic Planning Operators through Targeted
Reinforcement Learning [24.663586662594703]
Symbolic planning models allow decision-making agents to sequence actions in arbitrary ways to achieve a variety of goals in dynamic domains.
Reinforcement learning approaches do not require such models, and instead learn domain dynamics by exploring the environment and collecting rewards.
We propose an integrated framework named SPOTTER that uses RL to augment and support ("spot") a planning agent by discovering new operators needed to accomplish goals that are initially unreachable for the agent.
arXiv Detail & Related papers (2020-12-24T00:31:02Z) - Guided Adversarial Attack for Evaluating and Enhancing Adversarial
Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training.
We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries.
We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z) - Forethought and Hindsight in Credit Assignment [62.05690959741223]
We work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models.
We investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated.
arXiv Detail & Related papers (2020-10-26T16:00:47Z) - On the model-based stochastic value gradient for continuous
reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward.
Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z) - Online Bayesian Goal Inference for Boundedly-Rational Planning Agents [46.60073262357339]
We present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions.
Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning.
We develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models.
arXiv Detail & Related papers (2020-06-13T01:48:10Z) - Intrinsic Motivation for Encouraging Synergistic Behavior [55.10275467562764]
We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks.
Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own.
arXiv Detail & Related papers (2020-02-12T19:34:51Z) - Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior.
As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones.
We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.