CAFE: Learning to Condense Dataset by Aligning Features
- URL: http://arxiv.org/abs/2203.01531v1
- Date: Thu, 3 Mar 2022 05:58:49 GMT
- Title: CAFE: Learning to Condense Dataset by Aligning Features
- Authors: Kai Wang, Bo Zhao, Xiangyu Peng, Zheng Zhu, Shuo Yang, Shuo Wang, Guan
Huang, Hakan Bilen, Xinchao Wang, and Yang You
- Abstract summary: We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
- Score: 72.99394941348757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset condensation aims at reducing the network training effort through
condensing a cumbersome training set into a compact synthetic one.
State-of-the-art approaches largely rely on learning the synthetic data by
matching the gradients between the real and synthetic data batches. Despite the
intuitive motivation and promising results, such gradient-based methods, by
nature, easily overfit to a biased set of samples that produce dominant
gradients, and thus lack global supervision of data distribution. In this
paper, we propose a novel scheme to Condense dataset by Aligning FEatures
(CAFE), which explicitly attempts to preserve the real-feature distribution as
well as the discriminant power of the resulting synthetic set, lending itself
to strong generalization capability to various architectures. At the heart of
our approach is an effective strategy to align features from the real and
synthetic data across various scales, while accounting for the classification
of real samples. Our scheme is further backed up by a novel dynamic bi-level
optimization, which adaptively adjusts parameter updates to prevent
over-/under-fitting. We validate the proposed CAFE across various datasets, and
demonstrate that it generally outperforms the state of the art: on the SVHN
dataset, for example, the performance gain is up to 11%. Extensive experiments
and analyses verify the effectiveness and necessity of proposed designs.
Related papers
- Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis [55.65459867300319]
LLMs demonstrate remarkable capabilities in following natural language instructions, largely due to instruction-tuning on high-quality datasets.
Recent approaches incorporate feedback to improve data quality, but typically operate at the sample level, generating and applying feedback for each response individually.
We propose Reference-Level Feedback, a novel methodology that instead collects feedback based on high-quality reference samples from carefully curated seed data.
arXiv Detail & Related papers (2025-02-06T21:29:00Z) - Enhancing Generalization via Sharpness-Aware Trajectory Matching for Dataset Condensation [37.77454972709646]
We introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalization capability of learned synthetic datasets.
Our approach is mathematically well-supported and straightforward to implement along with controllable computational overhead.
arXiv Detail & Related papers (2025-02-03T22:30:06Z) - Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information [43.44508080585033]
We introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset.
We minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset.
arXiv Detail & Related papers (2024-12-13T08:10:47Z) - Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric.
Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets.
We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z) - Dataset Distillation for Histopathology Image Classification [46.04496989951066]
We introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD)
We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks.
arXiv Detail & Related papers (2024-08-19T05:53:38Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Implicitly Guided Design with PropEn: Match your Data to Follow the Gradient [52.2669490431145]
PropEn is inspired by'matching', which enables implicit guidance without training a discriminator.
We show that training with a matched dataset approximates the gradient of the property of interest while remaining within the data distribution.
arXiv Detail & Related papers (2024-05-28T11:30:19Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - Bridging the Gap: Enhancing the Utility of Synthetic Data via
Post-Processing Techniques [7.967995669387532]
generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data.
We propose three novel post-processing techniques to improve the quality and diversity of the synthetic dataset.
Experiments show that Gap Filler (GaFi) effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively.
arXiv Detail & Related papers (2023-05-17T10:50:38Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.