CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
- URL: http://arxiv.org/abs/2602.21655v1
- Date: Wed, 25 Feb 2026 07:34:26 GMT
- Title: CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
- Authors: Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou, Anxiang Zeng, Jianqiang Huang,
- Abstract summary: We introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus.<n>For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries.<n>For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries.
- Score: 23.289413412387223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
Related papers
- CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning [90.19455861166745]
We introduce Captioning Reinforcement Learning (CapRL), a training framework that redefines caption quality through its utility.<n>As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings.<n>CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%.
arXiv Detail & Related papers (2025-09-26T17:59:55Z) - SC-Captioner: Improving Image Captioning with Self-Correction by Reinforcement Learning [21.739084696595427]
SC-Captioner is a reinforcement learning framework that enables the self-correcting capability of image caption models.<n>We calculate the set difference between sets of initial and self-corrected captions to identify added and removed elements.<n>Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios.
arXiv Detail & Related papers (2025-08-08T08:45:52Z) - ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.<n>We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [8.526212812623202]
State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context dataset.<n>We present a novel approach to generate richer and more informative image captions by combining the captions generated from different SoTA captioning models.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning.
We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.