Related papers: EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

URL: http://arxiv.org/abs/2510.23569v1
Date: Mon, 27 Oct 2025 17:38:17 GMT
Title: EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang,
Abstract summary: EgoThinker is a framework that endows MLs with robust egocentric reasoning capabilities through-temporal chain-of-thought supervision and a two-stage learning curriculum.<n>EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained-temporal localization tasks.
Score: 56.24624833924252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.

Related papers

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention [58.05340906967343]
Egocentric Referring Video Object (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos.<n>Existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets.<n>We introduce Causal-REferring (CERES), a plug-in causal framework that adapts strong, pre-trained RVOSs to the egocentric domain.
arXiv Detail & Related papers (2025-12-30T16:22:14Z)
EgoCross: Benchmarking Multimodal Large Language Models for Cross-Domain Egocentric Video Question Answering [59.94048858464922]
EgoCross is a benchmark to evaluate the cross-domain generalization of MLLMs in EgocentricQA.<n>EgoCross covers four diverse and challenging domains, including surgery, industry, extreme sports, and animal perspective.<n>It comprises approximately 1,000 QA pairs across 798 video clips, spanning four key QA tasks: prediction, recognition, localization, and counting.
arXiv Detail & Related papers (2025-08-14T15:11:20Z)
Fine-grained Spatiotemporal Grounding on Egocentric Videos [13.319346673043286]
We introduce EgoMask, the first pixel-level benchmark for fine-temporal grounding in egocentric videos.<n>EgoMask is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks.<n>We also create EgoMask-Train, a large-scale training dataset to facilitate model development.
arXiv Detail & Related papers (2025-08-01T10:53:27Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
EgoToM: Benchmarking Theory of Mind Reasoning from Egocentric Videos [26.930652137352197]
We introduce EgoToM, a new video question-answering benchmark that extends Theory-of-Mind evaluation to egocentric domains.<n>Using a causal ToM model, we generate multi-choice video QA instances for the Ego4D dataset to benchmark the ability to predict a camera wearer's goals, beliefs, and next actions.<n>We study the performance of both humans and state of the art multimodal large language models (MLLMs) on these three interconnected inference problems.
arXiv Detail & Related papers (2025-03-28T05:10:59Z)
Omnia de EgoTempo: Benchmarking Temporal Understanding of Multi-Modal LLMs in Egocentric Videos [51.8995932557911]
EgoTempo is a dataset designed to evaluate temporal understanding in the egocentric domain.<n>We show that state-of-the-art Multi-Modal Large Language Models (MLLMs) on benchmarks achieve remarkably high performance using just text or a single frame as input.<n>We hope EgoTempo will catalyze new research in the field and inspire models that better capture the complexity of temporal dynamics.
arXiv Detail & Related papers (2025-03-17T18:50:36Z)
Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding [69.96199605596138]
Current MLLMs primarily focus on third-person (exocentric) vision, overlooking the unique aspects of first-person (egocentric) videos.<n>We propose learning the mapping between exocentric and egocentric domains to enhance egocentric video understanding.<n>We introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs.
arXiv Detail & Related papers (2025-03-12T08:10:33Z)
Do Egocentric Video-Language Models Truly Understand Hand-Object Interactions? [48.702973928321946]
Egocentric video-language pretraining is a crucial step in advancing the understanding of hand-object interactions in first-person scenarios.<n>Despite successes on existing testbeds, we find that current EgoVLMs can be easily misled by simple modifications.<n>This raises the question: Do EgoVLMs truly understand hand-object interactions?
arXiv Detail & Related papers (2024-05-28T00:27:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.