Related papers: When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

URL: http://arxiv.org/abs/2602.08995v1
Date: Mon, 09 Feb 2026 18:41:15 GMT
Title: When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents
Authors: Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun,
Abstract summary: This work makes the first effort to define and study misaligned action detection in computer-use agents (CUAs)<n>We identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels.<n>We propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
Score: 50.5814495434565
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.

Related papers

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use [6.622648583261088]
Agentic language models must plan, call tools, and execute long-horizon actions where a single misstep can cause irreversible harm.<n>We introduce MOSAIC, a framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable.<n>We show that MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance.
arXiv Detail & Related papers (2026-03-03T17:59:35Z)
Stress Testing Deliberative Alignment for Anti-Scheming Training [39.16405205129775]
Highly capable AI systems could secretly pursue misaligned goals -- what we call "scheming"<n> measuring and mitigating scheming requires different strategies than are typically used in ML.<n>We use a broad category of "covert actions" -- such as secretly breaking rules or intentionally underperforming in tests -- as a proxy for scheming.
arXiv Detail & Related papers (2025-09-19T02:49:56Z)
Anomalous Decision Discovery using Inverse Reinforcement Learning [3.3675535571071746]
Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems.<n>Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios.<n>We present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection.
arXiv Detail & Related papers (2025-07-06T17:01:02Z)
Defending against Indirect Prompt Injection by Instruction Detection [109.30156975159561]
InstructDetector is a novel detection-based approach that leverages the behavioral states of LLMs to identify potential IPI attacks.<n>InstructDetector achieves a detection accuracy of 99.60% in the in-domain setting and 96.90% in the out-of-domain setting, and reduces the attack success rate to just 0.03% on the BIPIA benchmark.
arXiv Detail & Related papers (2025-05-08T13:04:45Z)
Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning framework that centers on objects rather than pixels.<n>We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics.<n>Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%.
arXiv Detail & Related papers (2025-02-13T11:27:05Z)
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements. Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video. We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z)
Moving Forward by Moving Backward: Embedding Action Impact over Action Semantics [57.671493865825255]
We propose to model the impact of actions on-the-fly using latent embeddings. By combining these latent action embeddings with a novel, transformer-based, policy head, we design an Action Adaptive Policy. We show that our AAP is highly performant even when faced, at inference-time with missing actions and, previously unseen, perturbed action space.
arXiv Detail & Related papers (2023-04-24T17:35:47Z)
Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization [15.945378631406024]
Reinforcement learning (RL) has demonstrated impressive performance in decision-making tasks like embodied control, autonomous driving and financial trading. In many decision-making tasks, the agents often encounter the problem of executing actions under limited budgets. This paper formalizes the problem as a Sparse Action Markov Decision Process (SA-MDP), in which specific actions in the action space can only be executed for a limited time. We propose a policy optimization algorithm, Action Sparsity REgularization (ASRE), which adaptively handles each action with a distinct preference.
arXiv Detail & Related papers (2021-05-18T16:50:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.