CAFE: Learning to Condense Dataset by Aligning Features
- URL: http://arxiv.org/abs/2203.01531v1
- Date: Thu, 3 Mar 2022 05:58:49 GMT
- Title: CAFE: Learning to Condense Dataset by Aligning Features
- Authors: Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan
Huang, Hakan Bilen, Xinchao Wang, and Yang You
- Abstract summary: We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
- Score: 72.99394941348757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset condensation aims at reducing the network training effort through
condensing a cumbersome training set into a compact synthetic one.
State-of-the-art approaches largely rely on learning the synthetic data by
matching the gradients between the real and synthetic data batches. Despite the
intuitive motivation and promising results, such gradient-based methods, by
nature, easily overfit to a biased set of samples that produce dominant
gradients, and thus lack global supervision of data distribution. In this
paper, we propose a novel scheme to Condense dataset by Aligning FEatures
(CAFE), which explicitly attempts to preserve the real-feature distribution as
well as the discriminant power of the resulting synthetic set, lending itself
to strong generalization capability to various architectures. At the heart of
our approach is an effective strategy to align features from the real and
synthetic data across various scales, while accounting for the classification
of real samples. Our scheme is further backed up by a novel dynamic bi-level
optimization, which adaptively adjusts parameter updates to prevent
over-/under-fitting. We validate the proposed CAFE across various datasets, and
demonstrate that it generally outperforms the state of the art: on the SVHN
dataset, for example, the performance gain is up to 11%. Extensive experiments
and analyses verify the effectiveness and necessity of proposed designs.
Related papers
- Towards Robust Out-of-Distribution Generalization: Data Augmentation and Neural Architecture Search Approaches [4.577842191730992]
We study ways toward robust OoD generalization for deep learning.
We first propose a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition.
We then study the problem of strengthening neural architecture search in OoD scenarios.
arXiv Detail & Related papers (2024-10-25T20:50:32Z) - Dataset Distillation for Histopathology Image Classification [46.04496989951066]
We introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD)
We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks.
arXiv Detail & Related papers (2024-08-19T05:53:38Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN [1.5749416770494706]
Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices.
We propose FLIGAN, a novel approach to address the issue of data incompleteness in FL.
Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process.
arXiv Detail & Related papers (2024-03-25T16:49:38Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - Dataset Condensation via Efficient Synthetic-Data Parameterization [40.56817483607132]
Machine learning with massive amounts of data comes at a price of huge computation costs and storage for training and tuning.
Recent studies on dataset condensation attempt to reduce the dependence on such massive data by synthesizing a compact training dataset.
We propose a novel condensation framework that generates multiple synthetic data with a limited storage budget via efficient parameterization considering data regularity.
arXiv Detail & Related papers (2022-05-30T09:55:31Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.