CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
- URL: http://arxiv.org/abs/2509.22647v1
- Date: Fri, 26 Sep 2025 17:59:55 GMT
- Title: CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
- Authors: Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin,
- Abstract summary: We introduce Captioning Reinforcement Learning (CapRL), a training framework that redefines caption quality through its utility.<n>As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings.<n>CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%.
- Score: 90.19455861166745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
Related papers
- CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning [23.289413412387223]
We introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus.<n>For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries.<n>For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries.
arXiv Detail & Related papers (2026-02-25T07:34:26Z) - ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models [43.76357924787902]
We propose a reinforcement learning-based post-training framework for personalized image captioning.<n>Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs.
arXiv Detail & Related papers (2025-06-23T07:55:52Z) - Multi-LLM Collaborative Caption Generation in Scientific Documents [30.856381292477177]
We introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP)<n>Our approach unfolds in three key modules.<n>Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions.
arXiv Detail & Related papers (2025-01-05T14:09:12Z) - Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP
Guided Reinforcement Learning [9.443456804893207]
Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training.
Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions.
We propose a new image captioning training strategy that makes use of GT captions in different ways.
arXiv Detail & Related papers (2024-02-21T17:05:06Z) - Reinforcement Learning from Diffusion Feedback: Q* for Image Search [2.5835347022640254]
We present two models for image generation using model-agnostic learning.
RLDF is a singular approach for visual imitation through prior-preserving reward function guidance.
It generates high-quality images over varied domains showcasing class-consistency and strong visual diversity.
arXiv Detail & Related papers (2023-11-27T09:20:12Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Scaling Up Vision-Language Pre-training for Image Captioning [51.639880603821446]
We present LEMON, a LargE-scale iMage captiONer for image captioning.
We show LEMON achieves new state of the arts on several major image captioning benchmarks.
arXiv Detail & Related papers (2021-11-24T02:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.