Fugu-MT 論文翻訳(概要): EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

論文の概要: EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

arxiv url: http://arxiv.org/abs/2603.12147v1
Date: Thu, 12 Mar 2026 16:46:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.227908
Title: EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next
Title（参考訳）: EgoIntent: 何、なぜ、次に起こるのかを理解するためのエゴセントリックなステップレベルのベンチマーク
Authors: Ye Pan, Chi Kit Wong, Yuanhuiyi Lyu, Hanqian Li, Jiahao Huo, Jiacheng Chen, Lutao Jiang, Xu Zheng, Xuming Hu,
Abstract要約: EgoIntentは、エゴセントリックビデオのためのステップレベルの意図理解ベンチマークである。 15の異なる屋内および屋外の日常生活シナリオにまたがる3,014のステップで構成されている。各クリップは、クエリされたステップの重要な結果の直前に切り替わる。
参考スコア（独自算出の注目度）: 52.87513180819888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding. Yet applications such as intelligent assistants, robotic imitation learning, and augmented reality guidance require understanding not only what a person is doing at each step, but also why and what comes next, in order to provide timely and context-aware support. To this end, we introduce EgoIntent, a step-level intent understanding benchmark for egocentric videos. It comprises 3,014 steps spanning 15 diverse indoor and outdoor daily-life scenarios, and evaluates models on three complementary dimensions: local intent (What), global intent (Why), and next-step plan (Next). Crucially, each clip is truncated immediately before the key outcome of the queried step (e.g., contact or grasp) occurs and contains no frames from subsequent steps, preventing future-frame leakage and enabling a clean evaluation of anticipatory step understanding and next-step planning. We evaluate 15 MLLMs, including both state-of-the-art closed-source and open-source models. Even the best-performing model achieves an average score of only 33.31 across the three intent dimensions, underscoring that step-level intent understanding in egocentric videos remains a highly challenging problem that calls for further investigation.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、様々なタスクにまたがる驚くべきビデオ推論機能を示す。しかし、エゴセントリックなビデオにおいて、人間の意図をきめ細かいレベルで理解する能力はほとんど解明されていない。既存のベンチマークは主にエピソードレベルの意図推論に焦点を当てており、ステップレベルの意図理解のより細かい粒度を見下ろしている。しかし、インテリジェントアシスタント、ロボット模倣学習、拡張現実ガイダンスといった応用は、人が各ステップで何をしているかを理解するだけでなく、タイムリーでコンテキストに応じたサポートを提供するために、次に来る理由と理由を理解する必要がある。この目的のために、エゴセントリックビデオのためのステップレベルの意図理解ベンチマークであるEgoIntentを紹介した。 15の異なる屋内および屋外の日常生活シナリオにまたがる3,014のステップで構成され、局所的意図(What)、世界的意図(Why)、次段階計画(Next)の3つの相補的な次元のモデルを評価する。各クリップは、クエリされたステップ(例えば、接触またはつかみ)のキー結果の直前に切断され、その後のステップからフレームを含まず、将来のフレームリークを防止し、予測ステップ理解及び次ステップ計画のクリーンな評価を可能にする。我々は、最先端のクローズドソースモデルとオープンソースモデルの両方を含む15のMLLMを評価した。最高のパフォーマンスモデルでさえ、3つの意図の次元の平均スコアは33.31点に過ぎず、エゴセントリックなビデオにおけるステップレベルの意図理解は、さらなる調査を求める非常に難しい問題である。

論文の概要: EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

関連論文リスト