EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
- URL: http://arxiv.org/abs/2506.24016v1
- Date: Mon, 30 Jun 2025 16:20:51 GMT
- Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
- Authors: Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho,
- Abstract summary: EXPERT provides structured explanations based on fluency, relevance, and descriptiveness.<n>We develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation.
- Score: 13.07921264603001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
Related papers
- Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.<n>We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.<n>The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z) - Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy [3.0429215246859465]
Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations.<n>To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline.<n>We refer to this self-evaluation approach as Explain-Query-Test (EQT)
arXiv Detail & Related papers (2025-01-20T20:07:18Z) - HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z) - Benchmarking and Improving Detail Image Caption [12.078715675876674]
Large vision-language model (LVLM) has long been regarded as a fundamental task in visual understanding.
We propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts.
We also design a more reliable caption evaluation metric called CAPTURE.
arXiv Detail & Related papers (2024-05-29T13:54:12Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - Saliency-Bench: A Comprehensive Benchmark for Evaluating Visual Explanations [14.09795099827903]
Saliency-Bench is a novel benchmark suite designed to evaluate visual explanations generated by saliency methods across multiple datasets.<n>The benchmark includes a standardized and unified evaluation pipeline for assessing faithfulness and alignment of the visual explanation.
arXiv Detail & Related papers (2023-10-12T17:26:16Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - Evaluation Metrics for Conditional Image Generation [100.69766435176557]
We present two new metrics for evaluating generative models in the class-conditional image generation setting.
A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts.
We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models.
arXiv Detail & Related papers (2020-04-26T12:15:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.