Towards Trustworthy Dataset Distillation
- URL: http://arxiv.org/abs/2307.09165v2
- Date: Sun, 11 Aug 2024 07:35:40 GMT
- Title: Towards Trustworthy Dataset Distillation
- Authors: Shijie Ma, Fei Zhu, Zhen Cheng, Xu-Yao Zhang,
- Abstract summary: dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset.
We propose a novel paradigm called Trustworthy dataset Distillation (TrustDD)
By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection.
- Score: 26.361077372859498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficiency and trustworthiness are two eternal pursuits when applying deep learning in real-world applications. With regard to efficiency, dataset distillation (DD) endeavors to reduce training costs by distilling the large dataset into a tiny synthetic dataset. However, existing methods merely concentrate on in-distribution (InD) classification in a closed-world setting, disregarding out-of-distribution (OOD) samples. On the other hand, OOD detection aims to enhance models' trustworthiness, which is always inefficiently achieved in full-data settings. For the first time, we simultaneously consider both issues and propose a novel paradigm called Trustworthy Dataset Distillation (TrustDD). By distilling both InD samples and outliers, the condensed datasets are capable of training models competent in both InD classification and OOD detection. To alleviate the requirement of real outlier data, we further propose to corrupt InD samples to generate pseudo-outliers, namely Pseudo-Outlier Exposure (POE). Comprehensive experiments on various settings demonstrate the effectiveness of TrustDD, and POE surpasses the state-of-the-art method Outlier Exposure (OE). Compared with the preceding DD, TrustDD is more trustworthy and applicable to open-world scenarios. Our code is available at https://github.com/mashijie1028/TrustDD
Related papers
- DistDD: Distributed Data Distillation Aggregation through Gradient Matching [14.132062317010847]
DistDD is a novel approach within the federated learning framework that reduces the need for repetitive communication by distilling data directly on clients' devices.
We provide a detailed convergence proof of the DistDD algorithm, reinforcing its mathematical stability and reliability for practical applications.
arXiv Detail & Related papers (2024-10-11T09:43:35Z) - OAL: Enhancing OOD Detection Using Latent Diffusion [5.357756138014614]
Outlier Aware Learning (OAL) framework synthesizes OOD training data directly in the latent space.
We introduce a mutual information-based contrastive learning approach that amplifies the distinction between In-Distribution (ID) and collected OOD features.
arXiv Detail & Related papers (2024-06-24T11:01:43Z) - Towards Adversarially Robust Dataset Distillation by Curvature Regularization [11.02948004359488]
We study how to embed adversarial robustness in distilled datasets, so that models trained on these datasets maintain the high accuracy and acquire better adversarial robustness.
We propose a new method that achieves this goal by incorporating curvature regularization into the distillation process with much less computational overhead than standard adversarial training.
arXiv Detail & Related papers (2024-03-15T06:31:03Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - MIM4DD: Mutual Information Maximization for Dataset Distillation [15.847690902246727]
We introduce mutual information (MI) as the metric to quantify the shared information between the synthetic and the real datasets.
We devise MIM4DD numerically maximizing the MI via a newly designed optimizable within a contrastive learning framework.
Experiment results show that MIM4DD can be implemented as an add-on module to existing SoTA DD methods.
arXiv Detail & Related papers (2023-12-27T16:22:50Z) - Diversified Outlier Exposure for Out-of-Distribution Detection via
Informative Extrapolation [110.34982764201689]
Out-of-distribution (OOD) detection is important for deploying reliable machine learning models on real-world applications.
Recent advances in outlier exposure have shown promising results on OOD detection via fine-tuning model with informatively sampled auxiliary outliers.
We propose a novel framework, namely, Diversified Outlier Exposure (DivOE), for effective OOD detection via informative extrapolation based on the given auxiliary outliers.
arXiv Detail & Related papers (2023-10-21T07:16:09Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Dataset Distillation: A Comprehensive Review [76.26276286545284]
dataset distillation (DD) aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset.
This paper gives a comprehensive review and summary of recent advances in DD and its application.
arXiv Detail & Related papers (2023-01-17T17:03:28Z) - Raising the Bar on the Evaluation of Out-of-Distribution Detection [88.70479625837152]
We define 2 categories of OoD data using the subtly different concepts of perceptual/visual and semantic similarity to in-distribution (iD) data.
We propose a GAN based framework for generating OoD samples from each of these 2 categories, given an iD dataset.
We show that a) state-of-the-art OoD detection methods which perform exceedingly well on conventional benchmarks are significantly less robust to our proposed benchmark.
arXiv Detail & Related papers (2022-09-24T08:48:36Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.