Related papers: CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

URL: http://arxiv.org/abs/2503.12329v1
Date: Sun, 16 Mar 2025 02:56:09 GMT
Title: CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Authors: Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, Jiajun Chen,
Abstract summary: We build a platform with over 6000 pairwise caption battles and high-quality human preference votes.<n>Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance.<n>We release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test.
Score: 41.135849912850695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do current VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess detailed caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show decent caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 94.3% correlation with human rankings at just $4 per test. Data and resources will be open-sourced at https://caparena.github.io.

Related papers

CaptionQA: Is Your Caption as Useful as the Image Itself? [39.852352842429376]
Image captions serve as efficient surrogates for visual content in systems such as retrieval, recommendation, and multi-step agentic inference pipelines.<n>We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions.<n>We release CaptionQA along with an open-source pipeline for extension to new domains.
arXiv Detail & Related papers (2025-11-26T03:43:32Z)
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions [3.8028282626618526]
VELA is an automatic evaluation metric for long captions developed within a novel LLM-Hybrid-as-a-Judge framework.<n>LongCap-Arena is a benchmark specifically designed for evaluating metrics for long captions.
arXiv Detail & Related papers (2025-09-30T05:52:34Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness [30.44039177018447]
CAPability is a comprehensive benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions.
arXiv Detail & Related papers (2025-02-19T07:55:51Z)
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.<n>We implement the token merging strategy, reducing the number of input visual tokens.<n>AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z)
Wolf: Captioning Everything with a World Summarization Framework [149.03339991072514]
Wolf is an automated captioning framework that adopts a mixture-of-experts approach. Our framework captures different levels of information and summarizes them efficiently. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-26T17:59:09Z)
Benchmarking and Improving Detail Image Caption [12.078715675876674]
Large vision-language model (LVLM) has long been regarded as a fundamental task in visual understanding. We propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts. We also design a more reliable caption evaluation metric called CAPTURE.
arXiv Detail & Related papers (2024-05-29T13:54:12Z)
Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text. Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z)
Transparent Human Evaluation for Image Captioning [70.03979566548823]
We develop a rubric-based human evaluation protocol for image captioning models. We show that human-generated captions show substantially higher quality than machine-generated ones. We hope that this work will promote a more transparent evaluation protocol for image captioning.
arXiv Detail & Related papers (2021-11-17T07:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.