Related papers: You Only Condense Once: Two Rules for Pruning Condensed Datasets

You Only Condense Once: Two Rules for Pruning Condensed Datasets

URL: http://arxiv.org/abs/2310.14019v1
Date: Sat, 21 Oct 2023 14:05:58 GMT
Title: You Only Condense Once: Two Rules for Pruning Condensed Datasets
Authors: Yang He, Lingao Xiao, Joey Tianyi Zhou
Abstract summary: You Only Condense Once (YOCO) produces smaller condensed datasets with two embarrassingly simple dataset pruning rules. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet.
Score: 41.92794134275854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size of the training dataset, particularly in on-device scenarios. However, these scenarios have two significant challenges: 1) the varying computational resources available on the devices require a dataset size different from the pre-defined condensed dataset, and 2) the limited computational resources often preclude the possibility of conducting additional condensation processes. We introduce You Only Condense Once (YOCO) to overcome these limitations. On top of one condensed dataset, YOCO produces smaller condensed datasets with two embarrassingly simple dataset pruning rules: Low LBPE Score and Balanced Construction. YOCO offers two key advantages: 1) it can flexibly resize the dataset to fit varying computational constraints, and 2) it eliminates the need for extra condensation processes, which can be computationally prohibitive. Experiments validate our findings on networks including ConvNet, ResNet and DenseNet, and datasets including CIFAR-10, CIFAR-100 and ImageNet. For example, our YOCO surpassed various dataset condensation and dataset pruning methods on CIFAR-10 with ten Images Per Class (IPC), achieving 6.98-8.89% and 6.31-23.92% accuracy gains, respectively. The code is available at: https://github.com/he-y/you-only-condense-once.

Related papers

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training [63.07024608399447]
We propose an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. We introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset.
arXiv Detail & Related papers (2025-04-17T17:58:13Z)
SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization [29.93981107658258]
This paper proposes a textbfSoft label compression-centric dataset condensation framework. It balances informativeness, discriminativeness, and compressibility of the condensed data. Experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases.
arXiv Detail & Related papers (2025-03-18T06:04:44Z)
Elucidating the Design Space of Dataset Condensation [23.545641118984115]
A concept within data-centric learning, dataset condensation efficiently transfers critical attributes from an original dataset to a synthetic version. We propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%.
arXiv Detail & Related papers (2024-04-21T18:19:27Z)
Multisize Dataset Condensation [34.14939894093381]
Multisize dataset condensation improves training efficiency in on-device scenarios. In this paper, we propose Multisize dataset condensation (MDC) by compressing N condensation processes into a single condensation process. Our method offers several benefits: 1) No additional condensation process is required; 2) reduced storage requirement by reusing condensed images.
arXiv Detail & Related papers (2024-03-10T03:43:02Z)
Dataset Condensation for Recommendation [29.239833773646975]
We propose a lightweight condensation framework tailored for recommendation (DConRec) We model the discrete user-item interactions via a probabilistic approach and design a pre-augmentation module to incorporate the potential preferences of users into the condensed datasets. Experimental results on multiple real-world datasets have demonstrated the effectiveness and efficiency of our framework.
arXiv Detail & Related papers (2023-10-02T09:30:11Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K. We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint. The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z)
Dataset Condensation with Latent Space Knowledge Factorization and Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset. Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes. We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z)
Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.