Fugu-MT 論文翻訳(概要): Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

論文の概要: Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

arxiv url: http://arxiv.org/abs/2605.11693v1
Date: Tue, 12 May 2026 07:50:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.67789
Title: Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
Title（参考訳）: テキストを超えて何が重要かを測定する:品質、アライメント、多様性によるマルチモーダル・サマリーの評価
Authors: Abid Ali, Diego Molla-Aliod, Usman Naseem,
Abstract要約: 本稿では,テキスト品質,モーダルアライメント,視覚的多様性などの評価を統合する統合評価フレームワークMM-Evalを紹介する。 MM-Evalは,(1)事実整合性のためにOpenFActScoreを用いて測定されたテキスト品質,(2)コヒーレンス,関連性,(2)MLLM-as-a-judgeアプローチを用いて評価された画像テキスト関連性,(3)Trncated CLIP Entropyを用いて定量化した画像セットの多様性の3つの構成要素から構成される。この環境では、事実整合性は、知覚された全体的な品質の重要な決定要因として機能する。
参考スコア（独自算出の注目度）: 12.669087812857533
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、マルチモーダル出力(MSMO)によるマルチモーダル要約を促進し、システムでは、多モーダルソースからの有能な視覚を伴う簡潔なテキスト要約を生成する。テキストの品質、画像テキストのアライメント、視覚的多様性は、通常、単調な指標を用いて独立して評価されるため、モダリティが忠実で有用な要約を共同で支持するかどうかを捉えることは困難である。このギャップに対処するために,テキスト品質,クロスモーダルアライメント,視覚的多様性などの評価を統合する統合評価フレームワークMM-Evalを導入する。 MM-Eval は,(1) 事実整合性のために OpenFActScore を用いて測定されたテキスト品質と,(2) MLLM-as-a-judge アプローチを用いて評価された画像テキスト関連性,(3) トレンシ付き CLIP エントロピーを用いて定量化した画像セットの多様性の3つの構成要素から構成される。 MM-EvalをmLLM-EVALニュースベンチマークでトレーニングした学習集約モデルを用いて校正し、コンポーネントのコントリビューションを人間の好みに合わせる。そこでは,視覚的関連性や多様性が相補的な信号を提供する一方で,事実整合性は全体的な品質に対する重要な決定要因として機能する。 MM-Evalはヒューリスティックアグリゲーションベースラインよりも改善され、マルチモーダルサマリーの比較評価のための解釈可能な参照弱フレームワークを提供する。

論文の概要: Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

関連論文リスト