Related papers: Identifying and Addressing Delusions for Target-Directed Decision-Making

Related papers

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Current Agents Fail to Leverage World Model as Tool for Foresight [61.82522354207919]
Generative world models offer a promising remedy: agents could use them to foresee outcomes before acting.<n>This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition.
arXiv Detail & Related papers (2026-01-07T13:15:23Z)
Brain-Inspired Planning for Better Generalization in Reinforcement Learning [0.0]
This thesis explores the direction of enhancing agents' zero-shot systematic generalization abilities.<n>We introduce a top-down attention mechanism, which allows a decision-time planning agent to dynamically focus its reasoning on the most relevant aspects of the environmental state.<n>We also developed the Skipper framework to automatically decompose complex tasks into simpler, more manageable sub-tasks.
arXiv Detail & Related papers (2025-11-09T17:32:55Z)
Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z)
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
Large Language Model (LLM) agents become more widespread, associated misalignment risks increase.<n>In this work, we approach misalignment as a conflict between the internal goals pursued by the model and the goals intended by its deployer.<n>We introduce a misalignment propensity benchmark, textscAgentMisalignment, a benchmark suite designed to evaluate the propensity of LLM agents to misalign in realistic scenarios.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach [26.259289475583522]
Multi-target adversarial attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously.<n>To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks.<n>We propose the 2D-TGAF framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors.
arXiv Detail & Related papers (2025-04-19T02:08:48Z)
Interpreting Emergent Planning in Model-Free Reinforcement Learning [13.820891288919002]
We present the first evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban.
arXiv Detail & Related papers (2025-04-02T16:24:23Z)
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs) This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Towards Transferable Attacks Against Vision-LLMs in Autonomous Driving with Typography [21.632703081999036]
Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems. We propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs.
arXiv Detail & Related papers (2024-05-23T04:52:02Z)
Deception in Reinforced Autonomous Agents [30.510998478048723]
We explore the ability of large language model (LLM)-based agents to engage in subtle deception. This behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build an adversarial testbed mimicking a legislative environment where two LLMs play opposing roles.
arXiv Detail & Related papers (2024-05-07T13:55:11Z)
Analyzing Intentional Behavior in Autonomous Agents under Uncertainty [3.0099979365586265]
Principled accountability for autonomous decision-making in uncertain environments requires distinguishing intentional outcomes from negligent designs from actual accidents. We propose analyzing the behavior of autonomous agents through a quantitative measure of the evidence of intentional behavior. In a case study, we show how our method can distinguish between 'intentional' and 'accidental' traffic collisions.
arXiv Detail & Related papers (2023-07-04T07:36:11Z)
Power-seeking can be probable and predictive for trained agents [3.616948583169635]
Power-seeking behavior is a key source of risk from advanced AI. We investigate how the training process affects power-seeking incentives. We show that power-seeking incentives can be probable and predictive.
arXiv Detail & Related papers (2023-04-13T13:29:01Z)
Towards Reasonable Budget Allocation in Untargeted Graph Structure Attacks via Gradient Debias [50.628150015907565]
Cross-entropy loss function is used to evaluate perturbation schemes in classification tasks. Previous methods use negative cross-entropy loss as the attack objective in attacking node-level classification models. This paper argues about the previous unreasonable attack objective from the perspective of budget allocation.
arXiv Detail & Related papers (2023-03-29T13:02:02Z)
Learning to Generate All Feasible Actions [4.333208181196761]
We introduce action mapping, a novel approach that divides the learning process into two steps: first learn feasibility and subsequently, the objective. This paper focuses on the feasibility part by learning to generate all feasible actions through self-supervised querying of the feasibility model. We demonstrate the agent's proficiency in generating actions across disconnected feasible action sets.
arXiv Detail & Related papers (2023-01-26T23:15:51Z)
Discrete Factorial Representations as an Abstraction for Goal Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups. We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z)
Order-Disorder: Imitation Adversarial Attacks for Black-box Neural Ranking Models [48.93128542994217]
We propose an imitation adversarial attack on black-box neural passage ranking models. We show that the target passage ranking model can be transparentized and imitated by enumerating critical queries/candidates. We also propose an innovative gradient-based attack method, empowered by the pairwise objective function, to generate adversarial triggers.
arXiv Detail & Related papers (2022-09-14T09:10:07Z)
A Tale of HodgeRank and Spectral Method: Target Attack Against Rank Aggregation Is the Fixed Point of Adversarial Game [153.74942025516853]
The intrinsic vulnerability of the rank aggregation methods is not well studied in the literature. In this paper, we focus on the purposeful adversary who desires to designate the aggregated results by modifying the pairwise data. The effectiveness of the suggested target attack strategies is demonstrated by a series of toy simulations and several real-world data experiments.
arXiv Detail & Related papers (2022-09-13T05:59:02Z)
On Almost-Sure Intention Deception Planning that Exploits Imperfect Observers [24.11353445650682]
Intention deception involves computing a strategy which deceives the opponent into a wrong belief about the agent's intention or objective. This paper studies a class of probabilistic planning problems with intention deception and investigates how a defender's limited sensing modality can be exploited.
arXiv Detail & Related papers (2022-09-01T16:38:03Z)
Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process. We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z)
Path-Specific Objectives for Safer Agent Incentives [15.759504531768219]
We describe settings with 'delicate' parts of the state which should not be used as a means to an end. We then train agents to maximize the causal effect of actions on the expected return which is not mediated by the delicate parts of state. The resulting agents have no incentive to control the delicate state.
arXiv Detail & Related papers (2022-04-21T11:01:31Z)
Targeted Attack on Deep RL-based Autonomous Driving with Learned Visual Patterns [18.694795507945603]
Recent studies demonstrated the vulnerability of control policies learned through deep reinforcement learning against adversarial attacks. This paper investigates the feasibility of targeted attacks through visually learned patterns placed on physical object in the environment.
arXiv Detail & Related papers (2021-09-16T04:59:06Z)
Deceptive Decision-Making Under Uncertainty [25.197098169762356]
We study the design of autonomous agents that are capable of deceiving outside observers about their intentions while carrying out tasks. By modeling the agent's behavior as a Markov decision process, we consider a setting where the agent aims to reach one of multiple potential goals. We propose a novel approach to model observer predictions based on the principle of maximum entropy and to efficiently generate deceptive strategies.
arXiv Detail & Related papers (2021-09-14T14:56:23Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z)
SPOTTER: Extending Symbolic Planning Operators through Targeted Reinforcement Learning [24.663586662594703]
Symbolic planning models allow decision-making agents to sequence actions in arbitrary ways to achieve a variety of goals in dynamic domains. Reinforcement learning approaches do not require such models, and instead learn domain dynamics by exploring the environment and collecting rewards. We propose an integrated framework named SPOTTER that uses RL to augment and support ("spot") a planning agent by discovering new operators needed to accomplish goals that are initially unreachable for the agent.
arXiv Detail & Related papers (2020-12-24T00:31:02Z)
Guided Adversarial Attack for Evaluating and Enhancing Adversarial Defenses [59.58128343334556]
We introduce a relaxation term to the standard loss, that finds more suitable gradient-directions, increases attack efficacy and leads to more efficient adversarial training. We propose Guided Adversarial Margin Attack (GAMA), which utilizes function mapping of the clean image to guide the generation of adversaries. We also propose Guided Adversarial Training (GAT), which achieves state-of-the-art performance amongst single-step defenses.
arXiv Detail & Related papers (2020-11-30T16:39:39Z)
Forethought and Hindsight in Credit Assignment [62.05690959741223]
We work to understand the gains and peculiarities of planning employed as forethought via forward models or as hindsight operating with backward models. We investigate the best use of models in planning, primarily focusing on the selection of states in which predictions should be (re)-evaluated.
arXiv Detail & Related papers (2020-10-26T16:00:47Z)
On the model-based stochastic value gradient for continuous reinforcement learning [50.085645237597056]
We show that simple model-based agents can outperform state-of-the-art model-free agents in terms of both sample-efficiency and final reward. Our findings suggest that model-based policy evaluation deserves closer attention.
arXiv Detail & Related papers (2020-08-28T17:58:29Z)
Online Bayesian Goal Inference for Boundedly-Rational Planning Agents [46.60073262357339]
We present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning. We develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models.
arXiv Detail & Related papers (2020-06-13T01:48:10Z)
Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous [66.6895109554163]
Underlying the human ability to align goals with other agents is their ability to predict the intentions of others and actively update their own plans. We propose hierarchical predictive planning (HPP), a model-based reinforcement learning method for decentralized multiagent rendezvous.
arXiv Detail & Related papers (2020-03-15T19:49:20Z)
Intrinsic Motivation for Encouraging Synergistic Behavior [55.10275467562764]
We study the role of intrinsic motivation as an exploration bias for reinforcement learning in sparse-reward synergistic tasks. Our key idea is that a good guiding principle for intrinsic motivation in synergistic tasks is to take actions which affect the world in ways that would not be achieved if the agents were acting on their own.
arXiv Detail & Related papers (2020-02-12T19:34:51Z)
Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior. As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones. We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.