Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
- URL: http://arxiv.org/abs/2508.01742v1
- Date: Sun, 03 Aug 2025 12:52:27 GMT
- Title: Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
- Authors: Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie,
- Abstract summary: INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
- Score: 52.6091162517921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.
Related papers
- ViThinker: Active Vision-Language Reasoning via Dynamic Perceptual Querying [15.728211622542267]
ViThinker is a framework that enables vision-language models to autonomously generate decision tokens triggering the synthesis of expert-aligned visual features on demand.<n>ViThinker internalizes vision-expert capabilities during training, performing generative mental simulation during inference without external tool calls.
arXiv Detail & Related papers (2026-02-02T22:29:57Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention [58.05340906967343]
Egocentric Referring Video Object (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos.<n>Existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets.<n>We introduce Causal-REferring (CERES), a plug-in causal framework that adapts strong, pre-trained RVOSs to the egocentric domain.
arXiv Detail & Related papers (2025-12-30T16:22:14Z) - Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z) - Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools [41.993750134878766]
Video-STAR is a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition.<n>Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching.<n>Our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference.
arXiv Detail & Related papers (2025-10-09T17:20:44Z) - GoViG: Goal-Conditioned Visual Navigation Instruction Generation [69.79110149746506]
We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to autonomously generate precise and contextually coherent navigation instructions.<n>GoViG exclusively leverages raw egocentric visual data, substantially improving its adaptability to unseen and unstructured environments.
arXiv Detail & Related papers (2025-08-13T07:05:17Z) - EgoPrompt: Prompt Learning for Egocentric Action Recognition [49.12318087940015]
EgoPrompt is a prompt learning-based framework to conduct egocentric action recognition task.<n>EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.
arXiv Detail & Related papers (2025-08-05T09:47:07Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z) - Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z) - Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support [6.758533259752144]
Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation.<n>In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making.<n>This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on type, timing, and scale.
arXiv Detail & Related papers (2025-04-22T16:35:39Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - ReAct: Synergizing Reasoning and Acting in Language Models [44.746116256516046]
We show that large language models (LLMs) can generate both reasoning traces and task-specific actions in an interleaved manner.
We apply our approach, named ReAct, to a diverse set of language and decision making tasks.
ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API.
arXiv Detail & Related papers (2022-10-06T01:00:32Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.