Related papers: SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models

URL: http://arxiv.org/abs/2407.20756v3
Date: Sat, 10 Aug 2024 15:06:12 GMT
Title: SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models
Authors: Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qinhan Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, Wentao Zhang,
Abstract summary: We introduce SynthVLM, a novel data synthesis pipeline for Vision Large Language Models (VLLMs) Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions. We achieve state-of-the-art (SoTA) performance on various vision question answering tasks, maintaining high alignment quality and preserving advanced language abilities.
Score: 39.55942000935765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, with the rise of web images, managing and understanding large-scale image datasets has become increasingly important. Vision Large Language Models (VLLMs) have recently emerged due to their robust vision-understanding capabilities. However, training these models requires vast amounts of data, posing challenges to efficiency, effectiveness, data quality, and privacy. In this paper, we introduce SynthVLM, a novel data synthesis pipeline for VLLMs. Unlike existing methods that generate captions from images, SynthVLM employs advanced diffusion models and high-quality captions to automatically generate and select high-resolution images from captions, creating precisely aligned image-text pairs. Leveraging these pairs, we achieve state-of-the-art (SoTA) performance on various vision question answering tasks, maintaining high alignment quality and preserving advanced language abilities. Moreover, SynthVLM surpasses traditional GPT-4 Vision-based caption generation methods in performance while significantly reducing computational overhead. Crucially, our method's reliance on purely generated data ensures the preservation of privacy, achieving SoTA performance with just 100k data points (only 18% of the official dataset size).

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data. It can generate high-quality instruction-tuning data. It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z)
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a variety of benchmarks. A substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. We establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M.
arXiv Detail & Related papers (2025-02-18T03:58:38Z)
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining [31.176432567292093]
We propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities.
arXiv Detail & Related papers (2024-10-21T16:32:41Z)
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL. We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution. Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z)
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations. We re-annotated the COCO and Visual Genome datasets using our FullAnno system. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements. We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models. Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z)
Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training. Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM. Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z)
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z)
Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z)
Interactive Data Synthesis for Systematic Vision Adaptation via LLMs-AIGCs Collaboration [48.54002313329872]
We propose a new paradigm of annotated data expansion named as ChatGenImage. The core idea behind it is to leverage the complementary strengths of diverse models to establish a highly effective and user-friendly pipeline for interactive data augmentation. We present fascinating results obtained from our ChatGenImage framework and demonstrate the powerful potential of our synthetic data for systematic vision adaptation.
arXiv Detail & Related papers (2023-05-22T07:53:36Z)
Multimodal Data Augmentation for Image Captioning using Diffusion Models [12.221685807426264]
We propose a data augmentation method, leveraging a text-to-image model called Stable Diffusion, to expand the training set. Experiments on the MS COCO dataset demonstrate the advantages of our approach over several benchmark methods. Further improvement regarding the training efficiency and effectiveness can be obtained after intentionally filtering the generated data.
arXiv Detail & Related papers (2023-05-03T01:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.