Fugu-MT 論文翻訳(概要): Ego-Grounding for Personalized Question-Answering in Egocentric Videos

論文の概要: Ego-Grounding for Personalized Question-Answering in Egocentric Videos

arxiv url: http://arxiv.org/abs/2604.01966v1
Date: Thu, 02 Apr 2026 12:29:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.781134
Title: Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Title（参考訳）: Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Authors: Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao,
Abstract要約: 本稿では,マルチモーダル大言語モデル (MLLM) を,ego-grounding を必要とするパーソナライズされた質問応答において初めて体系的に解析する。 MyEgoは、MLLMのカメラ装着者の理解、記憶、推論能力を評価するために設計された最初のエゴセントリックなVideoQAデータセットである。
参考スコア（独自算出の注目度）: 54.479709790133946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
Abstract（参考訳）: 本稿では,マルチモーダル大言語モデル (MLLM) の最初の体系的分析を行った。この目的のために,MLLMのカメラ装着者に対する理解,記憶,推論能力を評価するために設計された,最初のエゴセントリックなビデオQAデータセットであるMyEgoを紹介した。 MyEgoは541の長いビデオと5Kのパーソナライズされた質問で構成されており、"私のもの"、"私の活動"、"私の過去"について尋ねられている。ベンチマークによると、オープンソース対プロプライエタリ、思考対非思考、小規模対大規模のMLLMは、すべてMyEgoで苦労している。トップクローズドおよびオープンソースモデル(例: GPT-5、Qwen3-VL)は、それぞれ約46%と36%の精度しか達成せず、それぞれ40%と50%の人的パフォーマンスを追求している。驚くべきことに、明確な推論もモデルスケーリングも一貫した改善にはならない。関連するエビデンスが明示的に提供されると、モデルは改善されるが、時間の経過とともに減少し、"me"と"my past"の追跡と記憶の制限が示される。これらの知見は,エゴ中心ビデオにおける個人化されたQAを実現する上で,エゴグラウンドと長期記憶が果たす重要な役割を強調している。 MyEgoと我々の分析が、エゴ中心のパーソナライズ支援のためにこれらの領域のさらなる進歩を触媒することを期待している。データとコードはhttps://github.com/Ryogetsu3606/MyEgoで入手できる。

論文の概要: Ego-Grounding for Personalized Question-Answering in Egocentric Videos

関連論文リスト