VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
- URL: http://arxiv.org/abs/2509.25818v1
- Date: Tue, 30 Sep 2025 05:52:34 GMT
- Title: VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
- Authors: Kazuki Matsuda, Yuiga Wada, Shinnosuke Hirano, Seitaro Otsuki, Komei Sugiura,
- Abstract summary: VELA is an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework.<n>LongCap-Arena is a benchmark specifically designed for evaluating metrics for long captions.
- Score: 3.8028282626618526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we focus on the automatic evaluation of long and detailed image captions generated by multimodal Large Language Models (MLLMs). Most existing automatic evaluation metrics for image captioning are primarily designed for short captions and are not suitable for evaluating long captions. Moreover, recent LLM-as-a-Judge approaches suffer from slow inference due to their reliance on autoregressive inference and early fusion of visual information. To address these limitations, we propose VELA, an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework. Furthermore, we propose LongCap-Arena, a benchmark specifically designed for evaluating metrics for long captions. This benchmark comprises 7,805 images, the corresponding human-provided long reference captions and long candidate captions, and 32,246 human judgments from three distinct perspectives: Descriptiveness, Relevance, and Fluency. We demonstrated that VELA outperformed existing metrics and achieved superhuman performance on LongCap-Arena.
Related papers
- OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models [65.8015696586307]
We introduce OV-Fact, a novel method for measuring caption factuality of long captions.<n>Our method improves agreement with human judgments and captures both captionness (recall) and factual precision in the same metric.<n>Unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering.
arXiv Detail & Related papers (2025-07-25T13:38:06Z) - Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives [37.02849705736749]
The evaluation of machine-generated image captions is a complex and evolving challenge.<n>With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task.<n>This survey provides a comprehensive overview of advancements in image captioning evaluation.
arXiv Detail & Related papers (2025-03-18T18:03:56Z) - CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era [41.135849912850695]
We build a platform with over 6000 pairwise caption battles and high-quality human preference votes.<n>Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance.<n>We release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test.
arXiv Detail & Related papers (2025-03-16T02:56:09Z) - Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z) - LongCaptioning: Unlocking the Power of Long Video Caption Generation in Large Multimodal Models [52.05596926411973]
Large Multimodal Models (LMMs) have demonstrated exceptional performance in video captioning tasks.<n>In this paper, we investigate the limitations of LMMs in generating long captions for long videos.<n>We propose the LongCaption-Agent, a framework that synthesizes long caption data by hierarchical semantic aggregation.
arXiv Detail & Related papers (2025-02-21T11:40:23Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.<n>We implement the token merging strategy, reducing the number of input visual tokens.<n>AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.