Related papers: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

URL: http://arxiv.org/abs/2506.19848v1
Date: Tue, 24 Jun 2025 17:59:55 GMT
Title: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Authors: Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin,
Abstract summary: Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
Score: 128.8346376825612
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.

Related papers

CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning [23.289413412387223]
We introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus.<n>For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries.<n>For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries.
arXiv Detail & Related papers (2026-02-25T07:34:26Z)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning [90.19455861166745]
We introduce Captioning Reinforcement Learning (CapRL), a training framework that redefines caption quality through its utility.<n>As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings.<n>CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%.
arXiv Detail & Related papers (2025-09-26T17:59:55Z)
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization [9.914251544971686]
ReCap is a novel pipeline for event-enriched image retrieval and captioning.<n>It incorporates broader contextual information from relevant articles to generate narrative-rich captions.<n>Our approach addresses the limitations of standard vision-language models.
arXiv Detail & Related papers (2025-09-01T08:48:33Z)
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.<n>We implement the token merging strategy, reducing the number of input visual tokens.<n>AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z)
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z)
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts. Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model. We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z)
Cross-Domain Image Captioning with Discriminative Finetuning [20.585138136033905]
Fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language. We show that discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
arXiv Detail & Related papers (2023-04-04T09:33:16Z)
PromptCap: Prompt-Guided Task-Aware Image Captioning [118.39243917422492]
We propose PromptCap, a captioning model designed to serve as a better connector between images and black-box LMs. PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA.
arXiv Detail & Related papers (2022-11-15T19:07:53Z)
Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training. We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z)
Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z)
CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information. We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations. Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z)
Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability. Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.