Preemptive Detection and Correction of Misaligned Actions in LLM Agents
- URL: http://arxiv.org/abs/2407.11843v3
- Date: Fri, 27 Dec 2024 14:17:05 GMT
- Title: Preemptive Detection and Correction of Misaligned Actions in LLM Agents
- Authors: Haishuo Fang, Xiaodan Zhu, Iryna Gurevych,
- Abstract summary: InferAct is a novel approach to detect misaligned actions before execution.
It alerts users for timely correction, preventing adverse outcomes.
InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
- Score: 70.54226917774933
- License:
- Abstract: Deploying LLM-based agents in real-life applications often faces a critical challenge: the misalignment between agents' behavior and user intent. Such misalignment may lead agents to unintentionally execute critical actions that carry negative outcomes (e.g., accidentally triggering a "buy-now" in web shopping), resulting in undesirable or even irreversible consequences. Although addressing these issues is crucial, the preemptive detection and correction of misaligned actions remains relatively underexplored. To fill this gap, we introduce InferAct, a novel approach that leverages the belief reasoning ability of LLMs, grounded in Theory-of-Mind, to detect misaligned actions before execution. Once the misalignment is detected, InferAct alerts users for timely correction, preventing adverse outcomes and enhancing the reliability of LLM agents' decision-making processes. Experiments on three widely used tasks demonstrate that InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection. An in-depth evaluation of misalignment correction further highlights InferAct's effectiveness in improving agent alignment.
Related papers
- PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.
This poses a significant challenge to ensuring their safe deployment.
We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z) - PSSD: Making Large Language Models Self-denial via Human Psyche Structure [5.057375783924452]
We present PSSD, which refers to and implements the human psyche structure such that three distinct and interconnected roles contribute to human reasoning.
Extensive experiments demonstrate that the proposed design not only better enhance reasoning capabilities, but also seamlessly integrate with current models.
arXiv Detail & Related papers (2025-02-03T13:37:21Z) - Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling [9.305763502526833]
We propose an accountability model for task-oriented dialogue agents to address user overreliance via friction turns.
Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions.
arXiv Detail & Related papers (2025-01-17T17:40:12Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Evaluating Uncertainty-based Failure Detection for Closed-Loop LLM Planners [10.746821861109176]
Large Language Models (LLMs) have witnessed remarkable performance as zero-shot task planners for robotic tasks.
However, the open-loop nature of previous works makes LLM-based planning error-prone and fragile.
In this work, we introduce a framework for closed-loop LLM-based planning called KnowLoop, backed by an uncertainty-based MLLMs failure detector.
arXiv Detail & Related papers (2024-06-01T12:52:06Z) - Reconfidencing LLMs from the Grouping Loss Perspective [56.801251926946485]
Large Language Models (LLMs) are susceptible to generating hallucinated answers in a confident tone.
Recent findings show that controlling uncertainty must go beyond calibration.
We construct a new evaluation dataset derived from a knowledge base to assess confidence scores given to answers of Mistral and LLaMA.
arXiv Detail & Related papers (2024-02-07T15:40:22Z) - SMARLA: A Safety Monitoring Approach for Deep Reinforcement Learning Agents [7.33319373357049]
This paper introduces SMARLA, a black-box safety monitoring approach specifically designed for Deep Reinforcement Learning (DRL) agents.
SMARLA utilizes machine learning to predict safety violations by observing the agent's behavior during execution.
Empirical results reveal that SMARLA is accurate at predicting safety violations, with a low false positive rate, and can predict violations at an early stage, approximately halfway through the execution of the agent, before violations occur.
arXiv Detail & Related papers (2023-08-03T21:08:51Z) - Moving Forward by Moving Backward: Embedding Action Impact over Action
Semantics [57.671493865825255]
We propose to model the impact of actions on-the-fly using latent embeddings.
By combining these latent action embeddings with a novel, transformer-based, policy head, we design an Action Adaptive Policy.
We show that our AAP is highly performant even when faced, at inference-time with missing actions and, previously unseen, perturbed action space.
arXiv Detail & Related papers (2023-04-24T17:35:47Z) - Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework.
In these games, the agent solves a partially observable Markov decision process.
We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z) - How RL Agents Behave When Their Actions Are Modified [0.0]
Reinforcement learning in complex environments may require supervision to prevent the agent from attempting dangerous actions.
We present the Modified-Action Markov Decision Process, an extension of the MDP model that allows actions to differ from the policy.
arXiv Detail & Related papers (2021-02-15T18:10:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.