Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment
- URL: http://arxiv.org/abs/2409.17612v2
- Date: Tue, 5 Nov 2024 06:47:37 GMT
- Title: Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment
- Authors: Jiawei Du, Xin Zhang, Juncheng Hu, Wenxin Huang, Joey Tianyi Zhou,
- Abstract summary: We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets.
We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process.
Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
- Score: 39.137060714048175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sharp increase in data-related expenses has motivated research into condensing datasets while retaining the most informative features. Dataset distillation has thus recently come to the fore. This paradigm generates synthetic datasets that are representative enough to replace the original dataset in training a neural network. To avoid redundancy in these synthetic datasets, it is crucial that each element contains unique features and remains diverse from others during the synthesis stage. In this paper, we provide a thorough theoretical and empirical analysis of diversity within synthesized datasets. We argue that enhancing diversity can improve the parallelizable yet isolated synthesizing approach. Specifically, we introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process, thereby maximizing the representativeness and diversity of each synthetic instance. Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset. Extensive experiments across multiple datasets, including CIFAR, Tiny-ImageNet, and ImageNet-1K, demonstrate the superior performance of our method, highlighting its effectiveness in producing diverse and representative synthetic datasets with minimal computational expense. Our code is available at https://github.com/AngusDujw/Diversity-Driven-Synthesis.https://github.com/AngusDujw/Diversity-Drive n-Synthesis.
Related papers
- SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models [9.340077455871736]
Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes.
Recently, the use of large generative models to create synthetic data for image classification has been realized.
We propose the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance.
arXiv Detail & Related papers (2024-08-29T05:33:59Z) - Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance [16.047084318753377]
Imbalanced data and spurious correlations are common challenges in machine learning and data science.
Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges.
We introduce OPAL, a systematic oversampling approach that leverages the capabilities of large language models to generate high-quality synthetic data for minority groups.
arXiv Detail & Related papers (2024-06-05T21:24:26Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets [4.389150156866014]
Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning.
These benefits have clear empirical support, but the theoretical understanding of them is currently very light.
We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets.
arXiv Detail & Related papers (2024-02-06T13:20:46Z) - Sequential Subset Matching for Dataset Distillation [44.322842898670565]
We propose a new dataset distillation strategy called Sequential Subset Matching (SeqMatch)
Our analysis indicates that SeqMatch effectively addresses the coupling issue by sequentially generating the synthetic instances.
Our code is available at https://github.com/shqii1j/seqmatch.
arXiv Detail & Related papers (2023-11-02T19:49:11Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Data Distillation Can Be Like Vodka: Distilling More Times For Better
Quality [78.6359306550245]
We argue that using just one synthetic subset for distillation will not yield optimal generalization performance.
PDD synthesizes multiple small sets of synthetic images, each conditioned on the previous sets, and trains the model on the cumulative union of these subsets.
Our experiments show that PDD can effectively improve the performance of existing dataset distillation methods by up to 4.3%.
arXiv Detail & Related papers (2023-10-10T20:04:44Z) - Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching [19.8751746334929]
We present an algorithm that remains effective as the size of the synthetic dataset grows.
We experimentally find that the training stage of the trajectories we choose to match greatly affects the effectiveness of the distilled dataset.
In doing so, we successfully scale trajectory matching-based methods to larger synthetic datasets.
arXiv Detail & Related papers (2023-10-09T14:57:41Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.