Fugu-MT 論文翻訳(概要): EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

論文の概要: EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

arxiv url: http://arxiv.org/abs/2604.09535v1
Date: Fri, 10 Apr 2026 17:53:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.990793
Title: EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
Title（参考訳）: EgoTL: 長距離タスクのためのEgocentric Think-Aloud Chains
Authors: Lulin Liu, Dayou Li, Yiqing Liang, Sicong Jiang, Hitesh Vijay, Hezhen Hu, Xuhai Xu, Zirui Liu, Srinivas Shakkottai, Manling Li, Zhiwen Fan,
Abstract要約: EgoTLは、エゴセントリックなデータのためのシンクアラウドキャプチャパイプラインを構築する。 VLMとWorld Modelsを6つのタスクディメンションでベンチマークします。ファンデーションモデルは、エゴセントリックなアシスタントやオープンワールドシミュレーターとして依然として不足している。
参考スコア（独自算出の注目度）: 45.45270001274317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large foundation models have made significant advances in embodied intelligence, enabling synthesis and reasoning over egocentric input for household tasks. However, VLM-based auto-labeling is often noisy because the primary data sources lack accurate human action labels, chain-of-thought (CoT), and spatial annotations; these errors are amplified during long-horizon spatial instruction following. These issues stem from insufficient coverage of minute-long, daily household planning tasks and from inaccurate spatial grounding. As a result, VLM reasoning chains and world-model synthesis can hallucinate objects, skip steps, or fail to respect real-world physical attributes. To address these gaps, we introduce EgoTL. EgoTL builds a think-aloud capture pipeline for egocentric data. It uses a say-before-act protocol to record step-by-step goals and spoken reasoning with word-level timestamps, then calibrates physical properties with metric-scale spatial estimators, a memory-bank walkthrough for scene context, and clip-level tags for navigation instructions and detailed manipulation actions. With EgoTL, we are able to benchmark VLMs and World Models on six task dimensions from three layers and long-horizon generation over minute-long sequences across over 100 daily household tasks. We find that foundation models still fall short as egocentric assistants or open-world simulators. Finally, we finetune foundation models with human CoT aligned with metric labels on the training split of EgoTL, which improves long-horizon planning and reasoning, step-wise reasoning, instruction following, and spatial grounding.
Abstract（参考訳）: 大規模な基礎モデルは、インテリジェンスを具現化し、家事タスクのエゴセントリックな入力に対する合成と推論を可能にした。しかしながら、VLMベースの自動ラベル付けは、人間の行動ラベル、チェーン・オブ・シンクレット(CoT)、空間アノテーションを欠いているため、ノイズが多い。これらの問題は、1分間の日常的な計画課題のカバー不足と、不正確な空間的接地によるものである。結果として、VLM推論チェーンとワールドモデル合成は、オブジェクトを幻覚させたり、ステップをスキップしたり、現実世界の物理的属性を尊重しなかったりすることができる。これらのギャップに対処するために、EgoTLを紹介します。 EgoTLは、エゴセントリックなデータのためのシンクアラウドキャプチャパイプラインを構築する。ステップバイステップの目標と音声による推論を単語レベルのタイムスタンプで記録し、メトリックスケールの空間推定器で物理特性を校正し、シーンコンテキストのためのメモリバンクウォークスルー、ナビゲーション命令と詳細な操作アクションのためのクリップレベルタグを使用する。 EgoTLでは、VLMとWorld Modelsを3層から6つのタスクディメンションでベンチマークし、100以上の日常的なタスクにまたがる1分間のシーケンスでロングホライゾンを生成することができる。ファンデーションモデルは、エゴセントリックなアシスタントやオープンワールドシミュレーターとして依然として不足している。最後に,人間CoTを用いた基礎モデルを,EgoTLのトレーニング分割のメトリクスラベルに合わせることで,長期計画と推論,ステップワイズ推論,指示追従,空間接地を改善する。

論文の概要: EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

関連論文リスト