CaptionQA: Is Your Caption as Useful as the Image Itself?
- URL: http://arxiv.org/abs/2511.21025v1
- Date: Wed, 26 Nov 2025 03:43:32 GMT
- Title: CaptionQA: Is Your Caption as Useful as the Image Itself?
- Authors: Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu,
- Abstract summary: Image captions serve as efficient surrogates for visual content in systems such as retrieval, recommendation, and multi-step agentic inference pipelines.<n>We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions.<n>We release CaptionQA along with an open-source pipeline for extension to new domains.
- Score: 39.852352842429376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.
Related papers
- SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering [15.985057987715974]
We propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA)<n>SCRA-VQA employs a pre-trained visual language model to convert images into captions.<n>It generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information.
arXiv Detail & Related papers (2025-09-25T08:01:28Z) - ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - Multi-LLM Collaborative Caption Generation in Scientific Documents [30.856381292477177]
We introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP)<n>Our approach unfolds in three key modules.<n>Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions.
arXiv Detail & Related papers (2025-01-05T14:09:12Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs.
PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption.
We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - CapWAP: Captioning with a Purpose [56.99405135645775]
We propose a new task, Captioning with a Purpose (CapWAP)
Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population.
We show that it is possible to use reinforcement learning to directly optimize for the intended information need.
arXiv Detail & Related papers (2020-11-09T09:23:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.