Related papers: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning

URL: http://arxiv.org/abs/2512.14420v1
Date: Tue, 16 Dec 2025 14:06:35 GMT
Title: DISCODE: Distribution-Aware Score Decoder for Robust Automatic Evaluation of Image Captioning
Authors: Nakamasa Inoue, Kanoko Goto, Masanari Oi, Martyna Gruszka, Mahiro Ukai, Takumi Hirose, Yusuke Sekikawa,
Abstract summary: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks.<n>We introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores.<n>In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric.
Score: 22.541665746109285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large vision-language models (LVLMs) have shown impressive performance across a broad range of multimodal tasks. However, robust image caption evaluation using LVLMs remains challenging, particularly under domain-shift scenarios. To address this issue, we introduce the Distribution-Aware Score Decoder (DISCODE), a novel finetuning-free method that generates robust evaluation scores better aligned with human judgments across diverse domains. The core idea behind DISCODE lies in its test-time adaptive evaluation approach, which introduces the Adaptive Test-Time (ATT) loss, leveraging a Gaussian prior distribution to improve robustness in evaluation score estimation. This loss is efficiently minimized at test time using an analytical solution that we derive. Furthermore, we introduce the Multi-domain Caption Evaluation (MCEval) benchmark, a new image captioning evaluation benchmark covering six distinct domains, designed to assess the robustness of evaluation metrics. In our experiments, we demonstrate that DISCODE achieves state-of-the-art performance as a reference-free evaluation metric across MCEval and four representative existing benchmarks.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge [51.93484138861584]
The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods.<n>We propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching.<n>Experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs.
arXiv Detail & Related papers (2026-02-10T05:07:46Z)
SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization [64.95852289011385]
Large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved.<n> evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs.<n>We propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection.
arXiv Detail & Related papers (2026-02-08T11:12:45Z)
Fluid Language Model Benchmarking [126.92394365620525]
We introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions.<n>Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level.<n>We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them.
arXiv Detail & Related papers (2025-09-14T05:49:42Z)
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs [29.764833226591012]
This paper introduces a certifiable and cost-efficient evaluation framework for large language models (LLMs)<n>We use test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity.<n>Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation.
arXiv Detail & Related papers (2025-05-02T17:05:01Z)
Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs.<n>We propose a new evaluation methodology that accounts for the groundedness of predictions.<n>Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z)
Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.