Related papers: Efficient Multimodal Dataset Distillation via Generative Models

Efficient Multimodal Dataset Distillation via Generative Models

URL: http://arxiv.org/abs/2509.15472v2
Date: Thu, 25 Sep 2025 22:23:18 GMT
Title: Efficient Multimodal Dataset Distillation via Generative Models
Authors: Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan,
Abstract summary: We introduce EDGE, a generative distillation method for efficient multimodal dataset distillation.<n>Specifically, we identify two key challenges of distilling multimodal datasets with generative models.<n>We propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss.<n>Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches.
Score: 37.60051495186203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.

Related papers

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis [8.74674837306488]
We propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization.<n>Our method uses CLIP to extract aligned image-text embeddings, obtains prototypes, and employs an unCLIP decoder to synthesize images.
arXiv Detail & Related papers (2026-02-23T12:08:28Z)
Diversity-Driven Generative Dataset Distillation Based on Diffusion Model with Self-Adaptive Memory [33.38900857290244]
We present a diversity-driven generative dataset distillation method based on a diffusion model to solve this problem.<n>We introduce self-adaptive memory to align the distribution between distilled and real datasets, assessing the representativeness.<n>Our method outperforms existing state-of-the-art methods in most situations.
arXiv Detail & Related papers (2025-05-26T03:48:56Z)
Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD)<n>CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z)
Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization [34.53986517177061]
We propose a novel framework to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation.<n>Our method starts by predicting noise generated by the diffusion model based on input images and text prompts, then calculates the corresponding loss for each pair.<n>This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.
arXiv Detail & Related papers (2024-12-13T08:34:46Z)
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z)
Low-Rank Similarity Mining for Multimodal Dataset Distillation [50.45577048854653]
We propose Low-Rank Similarity Mining (LoRS) for multimodal dataset distillation. LoRS distills a ground truth similarity matrix with image-text pairs, and leverages low-rank factorization for efficiency and scalability.
arXiv Detail & Related papers (2024-06-06T07:05:20Z)
One Category One Prompt: Dataset Distillation using Diffusion Models [22.512552596310176]
We introduce Diffusion Models (D3M) as a novel paradigm for dataset distillation, leveraging recent advancements in generative text-to-image foundation models. Our approach utilizes textual inversion, a technique for fine-tuning text-to-image generative models, to create concise and informative representations for large datasets.
arXiv Detail & Related papers (2024-03-11T20:23:59Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Vision-Language Dataset Distillation [26.886260846439612]
We design the first vision-language dataset distillation method, building on the idea of trajectory matching. A key challenge is that vision-language datasets do not have a set of discrete classes. Our proposed method jointly distills image-text pairs in a contrastive formulation.
arXiv Detail & Related papers (2023-08-15T03:22:40Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset. We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images. The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data. We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.