Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
- URL: http://arxiv.org/abs/2508.01742v1
- Date: Sun, 03 Aug 2025 12:52:27 GMT
- Title: Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation
- Authors: Qiaohui Chu, Haoyu Zhang, Meng Liu, Yisen Feng, Haoxiang Shi, Liqiang Nie,
- Abstract summary: INSIGHT is a two-stage framework for egocentric action anticipation.<n>In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions.<n>In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning.
- Score: 52.6091162517921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) -> intention inference (reason) -> action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.
Related papers
- EgoPrompt: Prompt Learning for Egocentric Action Recognition [49.12318087940015]
EgoPrompt is a prompt learning-based framework to conduct egocentric action recognition task.<n>EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.
arXiv Detail & Related papers (2025-08-05T09:47:07Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning [11.242852367476015]
DeepEyes is a model with "thinking with images" capabilities incentivized through end-to-end reinforcement learning.<n>We propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories.<n>DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T13:48:11Z) - Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z) - Navigating the State of Cognitive Flow: Context-Aware AI Interventions for Effective Reasoning Support [6.758533259752144]
Flow theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation.<n>In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making.<n>This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on type, timing, and scale.
arXiv Detail & Related papers (2025-04-22T16:35:39Z) - A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs [3.2228025627337864]
This paper introduces a structured evaluation framework to dissect the perception-reasoning interface in Vision-Language Models (VLMs)<n>We propose three distinct evaluation paradigms, mirroring human problem-solving strategies.<n>Applying this framework, we demonstrate that CA, leveraging powerful language models for reasoning over rich, independently generated descriptions, achieves new state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-01-23T12:42:42Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - ReAct: Synergizing Reasoning and Acting in Language Models [44.746116256516046]
We show that large language models (LLMs) can generate both reasoning traces and task-specific actions in an interleaved manner.
We apply our approach, named ReAct, to a diverse set of language and decision making tasks.
ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API.
arXiv Detail & Related papers (2022-10-06T01:00:32Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.