Related papers: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

URL: http://arxiv.org/abs/2502.14846v1
Date: Thu, 20 Feb 2025 18:55:30 GMT
Title: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Authors: Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark,
Abstract summary: CoSyn is a framework that creates synthetic text-rich multimodal data.<n>It can generate high-quality instruction-tuning data.<n>It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
Score: 79.71072337496351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

Related papers

Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Unicorn: Text-Only Data Synthesis for Vision Language Model Training [36.356035738286444]
Training vision-language models (VLMs) typically require large-scale, high-quality image-text pairs, but collecting or synthesizing such data is costly. We propose a cross-integrated three-stage multimodal data synthesis framework, which generates two datasets: Unicorn-1.2M and Unicorn-471K-Instruction.
arXiv Detail & Related papers (2025-03-28T17:43:00Z)
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
A substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.<n>We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.<n>Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.<n>We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales.
arXiv Detail & Related papers (2025-02-18T03:58:38Z)
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data [71.352883755806]
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space.<n>However, the limited labeled multimodal data often hinders embedding performance.<n>Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck.
arXiv Detail & Related papers (2025-02-12T15:03:33Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models [39.21242589835842]
We introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs.<n>To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs.<n>In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets.
arXiv Detail & Related papers (2024-07-30T11:57:40Z)
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z)
Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z)
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem. Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z)
RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis [104.53930611219654]
We present a large-scale synthetic dataset for novel view synthesis consisting of 300k images rendered from nearly 2000 complex scenes. The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures.
arXiv Detail & Related papers (2022-05-14T13:15:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.