Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
- URL: http://arxiv.org/abs/2502.06434v1
- Date: Mon, 10 Feb 2025 13:11:40 GMT
- Title: Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
- Authors: Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang,
- Abstract summary: We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.
Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.
We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
- Score: 60.42768987736088
- License:
- Abstract: Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic value of the image data, while also imposing additional burdens in terms of generation, storage, and application. To address these issues, we propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively, relies solely on hard labels for evaluation, and achieves state-of-the-art performance in this setup. By shifting the emphasis back to the images, our benchmark and PCA framework pave the way for more balanced and accessible techniques in dataset compression research. Our code is available at: https://github.com/ArmandXiao/Rethinking-Dataset-Compression
Related papers
- ODDN: Addressing Unpaired Data Challenges in Open-World Deepfake Detection on Online Social Networks [51.03118447290247]
We propose the open-world deepfake detection network (ODDN), which comprises open-world data aggregation (ODA) and compression-discard gradient correction (CGC)
ODA effectively aggregates correlations between compressed and raw samples through both fine-grained and coarse-grained analyses.
CGC incorporates a compression-discard gradient correction to further enhance performance across diverse compression methods in online social networks (OSNs)
arXiv Detail & Related papers (2024-10-24T12:32:22Z) - Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images.
We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z) - DD-RobustBench: An Adversarial Robustness Benchmark for Dataset Distillation [25.754877176280708]
We introduce a comprehensive benchmark that is the most extensive to date for evaluating the adversarial robustness of distilled datasets in a unified way.
Our benchmark significantly expands upon prior efforts by incorporating the latest advancements such as TESLA and SRe2L.
We also discovered that incorporating distilled data into the training batches of the original dataset can yield to improvement of robustness.
arXiv Detail & Related papers (2024-03-20T06:00:53Z) - Distributional Dataset Distillation with Subtask Decomposition [18.288856447840303]
We show that our method achieves state-of-the-art results on TinyImageNet and ImageNet-1K datasets.
Specifically, we outperform the prior art by $6.9%$ on ImageNet-1K under the storage budget of 2 images per class.
arXiv Detail & Related papers (2024-03-01T21:49:34Z) - Soft labelling for semantic segmentation: Bringing coherence to label
down-sampling [1.797129499170058]
In semantic segmentation, down-sampling is commonly performed due to limited resources.
We propose a novel framework for label down-sampling via soft-labeling.
This proposal also produces reliable annotations for under-represented semantic classes.
arXiv Detail & Related papers (2023-02-27T17:02:30Z) - Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline.
emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases.
Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - A Joint Pixel and Feature Alignment Framework for Cross-dataset
Palmprint Recognition [25.43285951112965]
We propose a novel Joint Pixel and Feature Alignment (JPFA) framework for cross-dataset palmprint recognition scenarios.
Two stage-alignment is applied to obtain adaptive features in source and target datasets.
Compared with baseline, the accuracy of cross-dataset identification is improved by up to 28.10% and the Equal Error Rate (EER) of cross-dataset verification is reduced by up to 4.69%.
arXiv Detail & Related papers (2020-05-25T11:40:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.