Related papers: SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training?

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data. It can generate high-quality instruction-tuning data. It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z)
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a variety of benchmarks. A substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. We establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M.
arXiv Detail & Related papers (2025-02-18T03:58:38Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning [23.63386159778117]
We design a emphcontrollable image-text synthesis pipeline, Ctrl Synth, for data-efficient and robust learning. Ctrl Synth allows users to control data synthesis in a fine-grained manner by defining customized control policies. We show that Ctrl Synth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
arXiv Detail & Related papers (2024-10-15T18:06:41Z)
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models [39.21242589835842]
We introduce SynthVLM, a novel data synthesis and curation method for generating image-caption pairs. To demonstrate the power of SynthVLM, we introduce SynthVLM-100K, a high-quality dataset consisting of 100,000 curated and synthesized image-caption pairs. In both model and human evaluations, SynthVLM-100K outperforms traditional real-world datasets.
arXiv Detail & Related papers (2024-07-30T11:57:40Z)
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better [39.57368843211441]
Every synthetic image ultimately originates from the upstream data used to train the generator. We compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion against finetuning on targeted real images retrieved directly from LAION-2B. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images.
arXiv Detail & Related papers (2024-06-07T18:04:21Z)
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z)
Learning Vision from Models Rivals Learning Vision from Data [54.43596959598465]
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs.
arXiv Detail & Related papers (2023-12-28T18:59:55Z)
Image Captions are Natural Prompts for Text-to-Image Models [70.30915140413383]
We analyze the relationship between the training effect of synthetic data and the synthetic data distribution induced by prompts. We propose a simple yet effective method that prompts text-to-image generative models to synthesize more informative and diverse training data. Our method significantly improves the performance of models trained on synthetic training data.
arXiv Detail & Related papers (2023-07-17T14:38:11Z)
Is synthetic data from generative models ready for image recognition? [69.42645602062024]
We study whether and how synthetic images generated from state-of-the-art text-to-image generation models can be used for image recognition tasks. We showcase the powerfulness and shortcomings of synthetic data from existing generative models, and propose strategies for better applying synthetic data for recognition tasks.
arXiv Detail & Related papers (2022-10-14T06:54:24Z)
SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models [9.934446907923725]
We introduce a new synthetic text image generator, SynthTIGER, by analyzing techniques used for text image synthesis and integrating effective ones under a single algorithm. In our experiments, SynthTIGER achieves better STR performance than the combination of synthetic datasets.
arXiv Detail & Related papers (2021-07-20T08:03:45Z)
Synthesize-It-Classifier: Learning a Generative Classifier through RecurrentSelf-analysis [9.029985847202667]
We show the generative capability of an image classifier network by synthesizing high-resolution, photo-realistic, and diverse images at scale. The overall methodology, called Synthesize-It-Classifier (STIC), does not require an explicit generator network to estimate the density of the data distribution. We demonstrate an Attentive-STIC network that shows an iterative drawing of synthesized images on the ImageNet dataset.
arXiv Detail & Related papers (2021-03-26T02:00:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.