Related papers: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

URL: http://arxiv.org/abs/2505.22613v1
Date: Wed, 28 May 2025 17:29:34 GMT
Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Authors: Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun,
Abstract summary: RICO is a novel framework that refines captions through visual reconstruction.<n>We introduce RICO-Flash, which learns to generate captions like RICO using DPO.<n>Our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on CapsBench and CompreCap.
Score: 22.72702783743817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

Related papers

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z)
Beam-Guided Knowledge Replay for Knowledge-Rich Image Captioning using Vision-Language Model [0.8747606955991707]
KRCapVLM is a knowledge replay-based novel image captioning framework.<n>We incorporate beam search decoding to generate more diverse and coherent captions.<n>Our proposed model demonstrates clear improvements in both the accuracy of knowledge recognition and the overall quality of generated captions.
arXiv Detail & Related papers (2025-05-29T11:33:36Z)
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models [33.76031793753807]
We adapt the autoregressive multimodal model Lumina-mGPT into a robust Real-ISR model, namely PURE.<n>PURE Perceives and Understands the input low-quality image, then REstores its high-quality counterpart.<n> Experimental results demonstrate that PURE preserves image content while generating realistic details.
arXiv Detail & Related papers (2025-03-14T04:33:59Z)
Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z)
Mitigating Open-Vocabulary Caption Hallucinations [33.960405731583656]
We propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations. To mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa.
arXiv Detail & Related papers (2023-12-06T17:28:03Z)
CoSeR: Bridging Image and Language for Cognitive Super-Resolution [74.24752388179992]
We introduce the Cognitive Super-Resolution (CoSeR) framework, empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding. To further improve image fidelity, we propose a novel condition injection scheme called "All-in-Attention"
arXiv Detail & Related papers (2023-11-27T16:33:29Z)
Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text. Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information. experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z)
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts. Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model. We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z)
Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts [5.859294565508523]
A new captioning model is developed including an image encoder to extract the features, a mixture of recurrent networks to embed the set of extracted features to a set of words, and a sentence generator that combines the obtained words as a stylized sentence. We show that the proposed captioning model can generate a diverse and stylized image captions without the necessity of extra-labeling.
arXiv Detail & Related papers (2020-07-07T11:00:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.