Fugu-MT 論文翻訳(概要): Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

論文の概要: Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2509.26165v1
Date: Tue, 30 Sep 2025 12:20:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.128756
Title: Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Title（参考訳）: Human-MME:Human-Centric Multimodal Large Language Modelの全体評価ベンチマーク
Authors: Yuansen Liu, Haiming Tang, Jinlong Peng, Jiangning Zhang, Xiaozhong Ji, Qingdong He, Donghao Luo, Zhenye Gan, Junwei Zhu, Yunhang Shen, Chaoyou Fu, Chengjie Wang, Xiaobin Hu, Shuicheng Yan,
Abstract要約: MLLM(Multimodal Large Language Models)は視覚的理解タスクにおいて大きな進歩を見せている。 Human-MMEは、人間中心のシーン理解におけるMLLMのより総合的な評価を提供するために設計された、キュレートされたベンチマークである。我々のベンチマークは、単一対象の理解を多対多の相互理解に拡張する。
参考スコア（独自算出の注目度）: 119.52829803686319
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は視覚的理解タスクにおいて大きな進歩を見せている。しかし、人間中心のシーンを理解する能力は、人間指向の粒度レベルと高次元の因果推論能力の両方を考慮に入れた総合的な評価基準が欠如していることから、ほとんど研究されていない。このような高品質な評価ベンチマークは、人体の物理的複雑さと、粒状構造に注釈をつけるのが難しいことを考えると、厳しい障害に直面します。本稿では,人間中心のシーン理解におけるMLLMのより包括的評価を提供するためのベンチマークであるHuman-MMEを提案する。他の既存のベンチマークと比較すると、作業には3つの重要な機能があります。ヒトのシーンにおける多様性は、15のセカンダリドメインと43のサブフィールドを持つ4つの一次視覚領域にまたがる。 2) 人間の目視から高次元の推論まで, 現実のイメージ質問対19,945の8次元と評価スイートからなる, 段階的かつ多種多様な評価次元を, 人間の目視から段階的に評価する。 3. リッチなデータパラダイムによる高品質なアノテーション,自動アノテーションパイプラインとヒューマンアノテーションプラットフォームの構築,厳密な手動ラベリングをサポートし,正確かつ信頼性の高いモデルアセスメントを容易にする。本ベンチマークでは, 選択, 短期回答, グラウンド, ランキング, 判断問題, 組み合わせに関する複雑な質問を構成することで, 多人数・多人数相互理解へのシングルターゲット理解を拡張した。 17の最先端MLLMに関する広範な実験は、その限界を効果的に露呈し、将来のMLLMの研究をより良い人間中心の画像理解へと導く。すべてのデータとコードはhttps://github.com/Yuan-Hou/Human-MMEで入手できる。

論文の概要: Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

関連論文リスト