Fugu-MT 論文翻訳(概要): MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

論文の概要: MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

arxiv url: http://arxiv.org/abs/2603.09827v2
Date: Wed, 11 Mar 2026 02:13:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 14:12:44.447122
Title: MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents
Title（参考訳）: MA-EgoQA: 複数の身体的エージェントによるエゴセントリックビデオに対する質問応答
Authors: Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang,
Abstract要約: エンボディモデルが強力になるにつれ、人間は将来、職場や自宅で複数のエンボディAIエージェントと協力するようになる。既存の課題には、ビデオ形式で個々の感覚入力を効果的に圧縮し、伝達することが含まれる。われわれはまず,複数のエンボディエージェントから同時に収集された複数のロングホライズン・エゴセントリックなビデオを理解するという,新しい問題を正式に定義する。
参考スコア（独自算出の注目度）: 54.48066948369172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.
Abstract（参考訳）: エンボディモデルが強力になるにつれ、人間は将来、職場や自宅で複数のエンボディAIエージェントと協力するようになる。ヒューマンユーザとマルチエージェントシステムとのコミュニケーションを改善するためには,エージェントからの入力情報を並列に解釈し,クエリ毎に適切なコンテキストを参照することが重要である。既存の課題としては、大量の個々の感覚入力をビデオ形式で効果的に圧縮し、伝達すること、システムレベルのメモリを構築するために複数のエゴセントリックなビデオを正しく集約することなどがある。本研究では,まず,複数のエンボディエージェントから同時に収集された複数の長軸エゴシックビデオを理解するという,新しい問題を正式に定義する。この方向の研究を容易にするために,既存のモデルをシステム的に評価するベンチマークであるMultiAgent-EgoQA(MA-EgoQA)を導入する。 MA-EgoQAは、複数のエゴセントリックストリームに固有の1.7kの質問を提供しており、社会的相互作用、タスク調整、理論・オブ・ミンド、時間的推論、環境相互作用の5つのカテゴリにまたがっている。さらに,EgoMASと呼ばれるMA-EgoQAの単純なベースラインモデルを提案する。 MA-EgoQAにおける多様なベースラインやEgoMASの総合的な評価を通じて、現在のアプローチでは複数のエゴセントリックストリームを効果的に扱うことができず、エージェント間のシステムレベルの理解の今後の進歩の必要性を強調している。コードとベンチマークはhttps://ma-egoqa.github.io.comで公開されている。

論文の概要: MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

関連論文リスト