Fugu-MT 論文翻訳(概要): BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

論文の概要: BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

arxiv url: http://arxiv.org/abs/2510.08759v1
Date: Thu, 09 Oct 2025 19:18:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.601283
Title: BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities
Title（参考訳）: BEAR: 原子爆弾能力のためのマルチモーダル言語モデルのベンチマークと強化
Authors: Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong,
Abstract要約: 身体能力とは、エージェントが物理的世界を理解し、理解し、相互作用する一連の基本的な能力を指す。我々は,原子エンボディド能力のMLLMを評価する,包括的できめ細かなベンチマークであるBEARを紹介する。 BEARは、低レベルポインティング、軌跡理解、空間的推論、高レベルプランニングといったタスクを含む、14のドメインにまたがる4,469のインターリーブイメージビデオテキストエントリで構成されている。我々は,MLLM知覚,3D理解,計画能力を強化するために,事前学習された視覚モデルを統合するマルチモーダル・コンバータブルエージェントであるBEAR-Agentを提案する。
参考スコア（独自算出の注目度）: 61.173773299032746
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/
Abstract（参考訳）: 身体能力とは、エージェントが物理的世界を理解し、理解し、相互作用する一連の基本的な能力を指す。マルチモーダルな大規模言語モデル(MLLM)は具体的エージェントとして有望であるが、既存のベンチマークは主に計画や空間的理解といった特定の領域に焦点を当てているため、その具体的能力の徹底的かつ体系的な評価は未定である。このギャップを埋めるために、我々はBEARを紹介した。BEARは包括的できめ細かなベンチマークで、原子組み込み能力のMLLMを評価する。 BEARは、低レベルポインティング、軌跡理解、空間的推論、高レベルプランニングといったタスクを含む、14のドメインにまたがる4,469のインターリーブイメージビデオテキストエントリで構成されている。 20種類のMLLMの広範囲な評価結果から, 生体機能の全領域にわたる持続的限界が明らかとなった。本稿では,MLLMの認識,3次元理解,計画能力を高めるために,事前学習された視覚モデルを統合するマルチモーダル・コンバータブルエージェントであるBEAR-Agentを提案する。 BEARではMLLMの性能を大幅に向上させ、GPT-5では9.12%、相対的に17.5%向上した。さらに,本実験により,MLLMの具体化能力の向上がシミュレーション環境における具体化作業に有効であることが示唆された。プロジェクトサイト: https://bear-official66.github.io/

論文の概要: BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

関連論文リスト