Fugu-MT 論文翻訳(概要): EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

論文の概要: EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

arxiv url: http://arxiv.org/abs/2510.23569v1
Date: Mon, 27 Oct 2025 17:38:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.646816
Title: EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT
Title（参考訳）: EgoThinker: Egocentric Reasoning with Spatio-Temporal CoT
Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Yu Qiao, Jiangmiao Pang,
Abstract要約: EgoThinkerは、時間的連鎖管理と2段階の学習カリキュラムを通じて、堅牢なエゴセントリック推論能力を備えたMLを支援するフレームワークである。 EgoThinkerは、複数のエゴセントリックなベンチマークで既存のメソッドよりも優れており、微粒な時間的ローカライゼーションタスクで大幅に改善されている。
参考スコア（独自算出の注目度）: 56.24624833924252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Egocentric video reasoning centers on an unobservable agent behind the camera who dynamically shapes the environment, requiring inference of hidden intentions and recognition of fine-grained interactions. This core challenge limits current multimodal large language models MLLMs, which excel at visible event reasoning but lack embodied, first-person understanding. To bridge this gap, we introduce EgoThinker, a novel framework that endows MLLMs with robust egocentric reasoning capabilities through spatio-temporal chain-of-thought supervision and a two-stage learning curriculum. First, we introduce EgoRe-5M, a large-scale egocentric QA dataset constructed from 13M diverse egocentric video clips. This dataset features multi-minute segments annotated with detailed CoT rationales and dense hand-object grounding. Second, we employ SFT on EgoRe-5M to instill reasoning skills, followed by reinforcement fine-tuning RFT to further enhance spatio-temporal localization. Experimental results show that EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained spatio-temporal localization tasks. Full code and data are released at https://github.com/InternRobotics/EgoThinker.
Abstract（参考訳）: エゴセントリックなビデオ推論は、環境を動的に形作るカメラの背後にある観察不能なエージェントに焦点を合わせ、隠れた意図の推論ときめ細かい相互作用の認識を必要とする。このコアチャレンジは、現在のマルチモーダルな大規模言語モデルMLLMを制限する。このギャップを埋めるために,時空間連鎖監視と2段階学習カリキュラムを通じて,MLLMに堅牢なエゴセントリック推論能力を与える新しいフレームワークであるEgoThinkerを紹介した。まずEgoRe-5Mを紹介する。これは大規模なエゴセントリックなQAデータセットで、1300万の多様なエゴセントリックなビデオクリップから構築されている。このデータセットは、詳細なCoT論理と密集した手オブジェクトグラウンドでアノテートされた複数分間のセグメントを特徴としている。第2に,SFTをEgoRe-5Mに応用して推論スキルを注入し,さらに強化微調整RFTを用いて時空間局所化をさらに促進する。実験の結果、EgoThinkerは複数のエゴセントリックなベンチマークで既存の手法よりも優れており、微粒な時空間局所化タスクでは大幅に改善されていることがわかった。完全なコードとデータはhttps://github.com/InternRobotics/EgoThinker.comで公開されている。

論文の概要: EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

関連論文リスト