Fugu-MT 論文翻訳(概要): DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

論文の概要: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

arxiv url: http://arxiv.org/abs/2512.14420v1
Date: Tue, 16 Dec 2025 14:06:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-17 16:49:26.73793
Title: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
Title（参考訳）: DISCODE:画像キャプションのロバスト自動評価のための分散対応スコアデコーダ
Authors: Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa,
Abstract要約: 大規模視覚言語モデル (LVLM) は、幅広いマルチモーダルタスクにおいて印象的な性能を示している。本研究では,ロバストな評価スコアを生成する新しいファインタニングフリー手法であるDis Distribution-Aware Score Decoder (DISCODE)を紹介する。本研究では,DECODEが参照不要評価指標として最先端の性能を達成することを示す。
参考スコア（独自算出の注目度）: 22.541665746109285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.
Abstract（参考訳）: 大規模視覚言語モデル (LVLM) は、幅広いマルチモーダルタスクで顕著な性能を示している。しかし,LVLMを用いたロバスト画像キャプション評価は,特にドメインシフトシナリオ下では困難である。この問題に対処するために,多分野にわたる人的判断に適合するロバストな評価スコアを生成する新しいファインタニングフリー手法であるDis Distribution-Aware Score Decoder (DISCODE) を導入する。 DISCODEの背後にある中核的な考え方は、テスト時間適応評価アプローチにある。これは、アダプティブテスト時間(ATT)損失を導入し、ガウスの事前分布を活用して評価スコア推定の堅牢性を改善する。この損失は、我々が引き起こす分析解を用いて、テスト時に効率的に最小化される。さらに、評価指標の堅牢性を評価するために、6つの異なる領域をカバーする新しい画像キャプション評価ベンチマークであるマルチドメインキャプション評価(MCEval)ベンチマークを導入する。本実験では,MCEvalおよび4つの既存ベンチマークを対象とした基準フリー評価指標として,DECODEが最先端性能を達成できることを実証した。

論文の概要: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

関連論文リスト