Related papers: Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction

URL: http://arxiv.org/abs/2512.04309v1
Date: Wed, 03 Dec 2025 22:54:15 GMT
Title: Text-Only Training for Image Captioning with Retrieval Augmentation and Modality Gap Correction
Authors: Rui Fonseca, Bruno Martins, Gil Rocha,
Abstract summary: TOMCap is an improved text-only training method that performs captioning without the need for aligned image-caption pairs.<n>We show that TOMCap outperforms other training-free and text-only methods.
Score: 3.8722109816646864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning has drawn considerable attention from the natural language processing and computer vision fields. Aiming to reduce the reliance on curated data, several studies have explored image captioning without any humanly-annotated image-text pairs for training, although existing methods are still outperformed by fully supervised approaches. This paper proposes TOMCap, i.e., an improved text-only training method that performs captioning without the need for aligned image-caption pairs. The method is based on prompting a pre-trained language model decoder with information derived from a CLIP representation, after undergoing a process to reduce the modality gap. We specifically tested the combined use of retrieved examples of captions, and latent vector representations, to guide the generation process. Through extensive experiments, we show that TOMCap outperforms other training-free and text-only methods. We also analyze the impact of different choices regarding the configuration of the retrieval-augmentation and modality gap reduction components.

Related papers

TextGuider: Training-Free Guidance for Text Rendering via Attention Alignment [68.91073792449201]
We propose TextGuider, a training-free method that encourages accurate and complete text appearance.<n>Specifically, we analyze attention patterns in Multi-Modal Diffusion Transformer(MM-DiT) models, particularly for text-related tokens intended to be rendered in the image.<n>Our method achieves state-of-the-art performance in test-time text rendering, with significant gains in recall and strong results in OCR accuracy and CLIP score.
arXiv Detail & Related papers (2025-12-10T06:18:30Z)
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning [3.8265756182141346]
We propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features.
arXiv Detail & Related papers (2024-09-26T16:47:32Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap. In particular, we introduce a subregion feature aggregation to leverage local region information. We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z)
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings. This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs. A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z)
CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality. We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus. CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z)
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks. The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z)
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining. We propose COSA, a COncatenated SAmple pretrained vision-language foundation model. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z)
FuseCap: Leveraging Large Language Models for Enriched Fused Image Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts. Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model. We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z)
Large-Scale Bidirectional Training for Zero-Shot Image Captioning [44.17587735943739]
We introduce Bidirectional Image Text Training in largER Scale, BITTERS, an efficient training and inference framework for zero-shot image captioning. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.
arXiv Detail & Related papers (2022-11-13T00:09:36Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.