Related papers: Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

URL: http://arxiv.org/abs/2512.03463v1
Date: Wed, 03 Dec 2025 05:36:46 GMT
Title: Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
Authors: Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi,
Abstract summary: We study text-centric training, a setting where only textual descriptions are available and no real images are provided.<n>Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available.<n>We propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas.
Score: 6.591422574318155
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.

Related papers

Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval [2.2613695007273926]
We introduce the Remote Sensing Rich Text dataset, a new benchmark featuring multiple structured captions per image.<n>Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA.<n>Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions.
arXiv Detail & Related papers (2025-12-11T12:43:41Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [41.63451923844824]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
Towards Visual Text Grounding of Multimodal Large Language Model [74.22413337117617]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases. To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.<n>Expanding images based on text requires global coherence and rich local context understanding.<n>We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z)
Image Captioning with Multi-Context Synthetic Data [16.961112970612447]
Large models have excelled in producing high-quality images and text. We present an innovative pipeline that introduces multi-context data generation. Our model is exclusively trained on synthetic image-text pairs crafted through this process.
arXiv Detail & Related papers (2023-05-29T13:18:59Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
Language Matters: A Weakly Supervised Pre-training Approach for Scene Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features. Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.