Fugu-MT 論文翻訳(概要): Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

論文の概要: Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

arxiv url: http://arxiv.org/abs/2509.17425v1
Date: Mon, 22 Sep 2025 07:17:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.25983
Title: Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments
Title（参考訳）: 家庭環境における日次複合タスクを用いたマルチモーダル大言語モデルの評価
Authors: Zhenliang Zhang, Yuxi Wang, Hongzhao Xie, Shiyun Zhao, Mingyuan Liu, Yujie Lu, Xinyi He, Zhenku Cheng, Yujia Peng,
Abstract要約: 幼児期における日常活動に触発された複合作業のセットを設計する。動的でシミュレーションされた家庭環境の中で、これらのタスクは3つの中核領域(オブジェクト理解、空間知性、社会活動)にまたがる。我々は、これらのタスクにおいて17のプロプライエタリかつオープンソースのMLLMを評価し、その結果は3つの領域すべてで一貫してパフォーマンスが悪く、現在の能力と汎用インテリジェンス要件の間に大きなギャップがあることを示唆している。
参考スコア（独自算出の注目度）: 22.833316179836704
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object understanding, spatial intelligence, and social activity. We evaluated 17 leading proprietary and open-source MLLMs on these tasks. The results consistently showed poor performance across all three domains, indicating a substantial gap between current capabilities and general intelligence requirements. Together, our tasks offer a preliminary framework for evaluating the general capabilities of embodied agents, marking an early but significant step toward the development of embodied MLLMs and their real-world deployment.
Abstract（参考訳）: 従来のAIと人工知能(AGI)を区別する重要な特徴は、AGIが幅広い能力を必要とする複合タスクを実行できることである。マルチモーダル大言語モデル (MLLM) を利用したエンボディエージェントは、知覚的かつインタラクティブな機能を提供しているが、複合的なタスクを解くことができるかどうかはまだ明らかになっていない。本研究は,幼少期における日常活動から着想を得た複合作業のセットを設計した。動的でシミュレーションされた家庭環境の中で、これらのタスクは3つの中核領域(オブジェクト理解、空間知性、社会活動)にまたがる。これらの課題に対して,プロプライエタリでオープンソースなMLLMを17件評価した。結果は3つの領域で一貫してパフォーマンスが悪く、現在の能力と汎用インテリジェンス要件の間に大きなギャップがあることを示唆している。我々のタスクは、具体化エージェントの一般的な能力を評価するための予備的なフレームワークを提供し、具体化MLLMの開発と実世界の展開に向けて、早い段階で重要な一歩を踏み出した。

論文の概要: Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments

関連論文リスト