Fugu-MT 論文翻訳(概要): ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

論文の概要: ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

arxiv url: http://arxiv.org/abs/2604.03765v2
Date: Mon, 13 Apr 2026 12:40:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 14:47:45.496278
Title: ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs
Title（参考訳）: itiscore:MLLMの画像キャプション能力のための画像とテキストと画像のレーティングフレームワーク
Authors: Zitong Xu, Huiyu Duan, Shengyao Qin, Guangyu Yang, Guangji Ma, Xiongkuo Min, Ke Gu, Guangtao Zhai, Patrick Le Callet,
Abstract要約: ICBenchは、12のコンテンツカテゴリをカバーする大規模な画像キャプションベンチマークであり、2K画像の短いキャプションと長いキャプションで構成されている。我々は、詳細な評価範囲で平均世論スコア(MOS)を得るために、広範囲にわたる主観的研究を行う。画像からテキストへ画像への変換を行うフレームワークをベースとした自動評価基準であるtextbfitiscore を提案する。
参考スコア（独自算出の注目度）: 84.09282931360089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal large language models (MLLMs) have greatly improved image understanding and captioning capabilities. However, existing image captioning benchmarks typically suffer from limited diversity in caption length, the absence of recent advanced MLLMs, and insufficient human annotations, which potentially introduces bias and limits the ability to comprehensively assess the performance of modern MLLMs. To address these limitations, we present a new large-scale image captioning benchmark, termed, ICBench, which covers 12 content categories and consists of both short and long captions generated by 10 advanced MLLMs on 2K images, resulting in 40K captions in total. We conduct extensive human subjective studies to obtain mean opinion scores (MOSs) across fine-grained evaluation dimensions, where short captions are assessed in terms of fluency, relevance, and conciseness, while long captions are evaluated based on fluency, relevance, and completeness. Furthermore, we propose an automated evaluation metric, \textbf{ITIScore}, based on an image-to-text-to-image framework, which measures caption quality through reconstruction consistency. Experimental results demonstrate strong alignment between our automatic metric and human judgments, as well as robust zero-shot generalization ability on other public captioning datasets. Both the dataset and model will be released upon publication.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)の最近の進歩は、画像理解とキャプション機能を大幅に改善した。しかし、既存の画像キャプションベンチマークは、キャプション長の制限、最近の高度なMLLMの欠如、人間のアノテーション不足に悩まされ、バイアスが発生し、現代のMLLMの性能を包括的に評価する能力が制限される可能性がある。これらの制約に対処するため、ICBenchと呼ばれる新しい大規模画像キャプションベンチマークを提案する。ICBenchは、12のコンテンツカテゴリをカバーし、2K画像上に10の高度なMLLMが生成する短いキャプションと長いキャプションの両方で構成されており、合計40Kキャプションとなる。本研究では,フレンシ,関連性,簡潔さの観点から短いキャプションを評価し,フレンシ,関連性,完全性に基づいて長いキャプションを評価できる,詳細な評価次元における平均アセスメントスコア(MOS)を得るために,広範囲にわたる人間の主観的研究を行う。さらに,画像からテキストへ画像へ変換するフレームワークをベースとした自動評価指標である「textbf{itiscore}」を提案する。実験の結果,我々の自動判断と人的判断の一致が強く,また他の公開キャプションデータセット上でのゼロショット一般化能力も良好であった。データセットとモデルの両方が公開される。

論文の概要: ITIScore: An Image-to-Text-to-Image Rating Framework for the Image Captioning Ability of MLLMs

関連論文リスト