Fugu-MT 論文翻訳(概要): Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

論文の概要: Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

arxiv url: http://arxiv.org/abs/2601.04946v2
Date: Sat, 10 Jan 2026 09:28:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-13 15:02:56.561711
Title: Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Title（参考訳）: マルチモーダル評価指標におけるBlindspotsの原型的バイアス
Authors: Subhadeep Roy, Gagan Bhatia, Steffen Eger,
Abstract要約: マルチモーダル評価において,システム障害モードとしての原形質バイアスについて検討する。我々は、動物、オブジェクト、デモグラフィー画像にまたがる対照ベンチマークProtoBiasを導入する。以上の結果から,CLIPScore,PickScore,VQAベースのスコアなど,広く使用されているメトリクスが,これらのペアを誤用していることが判明した。本稿では, 故障率を大幅に低減し, 誤判定を抑える, 頑健な7BパラメータであるProtoScoreを提案する。
参考スコア（独自算出の注目度）: 25.374192139098284
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study prototypicality bias as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark ProtoBias (Prototypical Bias), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose ProtoScore, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.
Abstract（参考訳）: 自動メトリクスはテキスト・ツー・イメージ・モデルの評価の中心となり、しばしばベンチマークや大規模なフィルタリングにおいて人間の判断に取って代わる。しかし、これらの指標が真に意味的正しさを優先するか、あるいは偏りのあるデータ分布から学習した視覚的および社会的に原始的なイメージを優先するかは定かではない。我々は,マルチモーダル評価において,原形質バイアスを系統的障害モードとして同定し,研究する。本稿では, 動物, オブジェクト, デモグラフィを対象とする対照ベンチマークProtoBias(Prototypeal Bias)を提案する。この設定により、メトリクスがテキストセマンティクスに従うか、あるいはプロトタイプをデフォルトにするかの方向評価が可能になる。以上の結果から,CLIPScore,PickScore,VQAベースのスコアなど,広く使用されている指標は,これらのペアを誤用することが多かったが,LLM-as-Judgeシステムでさえ,社会的根拠のあるケースでは不均一な堅牢性を示した。人間の評価は、より大きな意思決定マージンを持つ意味的正しさを一貫して好んでいる。これらの結果から, GPT-5の推測時間よりも桁違いに高速に動作しながら, より大規模なクローズドソース判断器の堅牢性にアプローチしながら, 故障率を大幅に低減し, 誤判定を抑える頑健な7BパラメトリックであるProtoScoreを提案する。

論文の概要: Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

関連論文リスト