Procedural Mistake Detection via Action Effect Modeling
- URL: http://arxiv.org/abs/2512.03474v1
- Date: Wed, 03 Dec 2025 05:56:17 GMT
- Title: Procedural Mistake Detection via Action Effect Modeling
- Authors: Wenliang Guo, Yujiang Pu, Yu Kong,
- Abstract summary: Action Effect Modeling (AEM) is a unified framework that captures action execution and its outcomes through a probabilistic formulation.<n>AEM identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality.<n>It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations.
- Score: 10.358293338390716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the \textbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.
Related papers
- PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z) - Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning framework that centers on objects rather than pixels.<n>We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics.<n>Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%.
arXiv Detail & Related papers (2025-02-13T11:27:05Z) - Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection [4.938957922033169]
Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts.<n>We propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN)<n>We show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
arXiv Detail & Related papers (2024-09-16T02:53:49Z) - Efficient Human-Object-Interaction (EHOI) Detection via Interaction Label Coding and Conditional Decision [33.59153869330463]
An Efficient HOI (EHOI) detector is proposed in this work to strike a good balance between detection performance, inference complexity, and mathematical transparency.
Our contributions include the application of error correction codes (ECCs) to encode rare interaction cases.
Experimental results demonstrate the advantages of ECC-coded interaction labels and the excellent balance of detection performance and complexity of the proposed EHOI method.
arXiv Detail & Related papers (2024-08-13T16:34:06Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Counterfactual Reasoning for Multi-Label Image Classification via Patching-Based Training [84.95281245784348]
Overemphasizing co-occurrence relationships can cause the overfitting issue of the model.
We provide a causal inference framework to show that the correlative features caused by the target object and its co-occurring objects can be regarded as a mediator.
arXiv Detail & Related papers (2024-04-09T13:13:24Z) - Visual Imitation Learning with Calibrated Contrastive Representation [44.63125396964309]
Adversarial Imitation Learning (AIL) allows the agent to reproduce expert behavior with low-dimensional states and actions.
This paper proposes a simple and effective solution by incorporating contrastive representative learning into visual AIL framework.
arXiv Detail & Related papers (2024-01-21T04:18:30Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.