SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization
- URL: http://arxiv.org/abs/2503.13935v1
- Date: Tue, 18 Mar 2025 06:04:44 GMT
- Title: SCORE: Soft Label Compression-Centric Dataset Condensation via Coding Rate Optimization
- Authors: Bowen Yuan, Yuxia Fu, Zijian Wang, Yadan Luo, Zi Huang,
- Abstract summary: This paper proposes a textbfSoft label compression-centric dataset condensation framework.<n>It balances informativeness, discriminativeness, and compressibility of the condensed data.<n>Experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases.
- Score: 29.93981107658258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding knowledge into realistic images with soft labeling, for their scalability to ImageNet-scale datasets and strong capability of cross-domain generalization. However, this strong performance comes at a substantial storage cost which could significantly exceed the storage cost of the original dataset. We argue that the three key properties to alleviate this performance-storage dilemma are informativeness, discriminativeness, and compressibility of the condensed data. Towards this end, this paper proposes a \textbf{S}oft label compression-centric dataset condensation framework using \textbf{CO}ding \textbf{R}at\textbf{E} (SCORE). SCORE formulates dataset condensation as a min-max optimization problem, which aims to balance the three key properties from an information-theoretic perspective. In particular, we theoretically demonstrate that our coding rate-inspired objective function is submodular, and its optimization naturally enforces low-rank structure in the soft label set corresponding to each condensed data. Extensive experiments on large-scale datasets, including ImageNet-1K and Tiny-ImageNet, demonstrate that SCORE outperforms existing methods in most cases. Even with 30$\times$ compression of soft labels, performance decreases by only 5.5\% and 2.7\% for ImageNet-1K with IPC 10 and 50, respectively. Code will be released upon paper acceptance.
Related papers
- Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.
We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - Curriculum Coarse-to-Fine Selection for High-IPC Dataset Distillation [67.34754791044242]
We introduce a novel curriculum coarse-to-fine selection (CCFS) method for efficient high-IPC dataset distillation.
CCFS employs a coarse-to-fine strategy to select appropriate real data based on the current synthetic dataset in each curriculum.
arXiv Detail & Related papers (2025-03-24T16:47:40Z) - Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.<n>Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.<n>We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z) - Decomposed Distribution Matching in Dataset Condensation [16.40653529334528]
Recent research formulates DC as a distribution matching problem which circumvents the costly bi-level optimization.
We present a simple yet effective method to match the style information between original and condensed data.
We demonstrate the efficacy of our method through experiments on diverse datasets of varying size and resolution.
arXiv Detail & Related papers (2024-12-06T03:20:36Z) - Heavy Labels Out! Dataset Distillation with Label Space Lightening [69.67681224137561]
HeLlO aims at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images.
We demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets.
arXiv Detail & Related papers (2024-08-15T15:08:58Z) - Elucidating the Design Space of Dataset Condensation [23.545641118984115]
A concept within data-centric learning, dataset condensation efficiently transfers critical attributes from an original dataset to a synthetic version.<n>We propose a comprehensive design framework that includes specific, effective strategies like implementing soft category-aware matching.<n>In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78%.
arXiv Detail & Related papers (2024-04-21T18:19:27Z) - You Only Condense Once: Two Rules for Pruning Condensed Datasets [41.92794134275854]
You Only Condense Once (YOCO) produces smaller condensed datasets with two embarrassingly simple dataset pruning rules.
Experiments validate our findings on networks including ConvNet, ResNet and DenseNet.
arXiv Detail & Related papers (2023-10-21T14:05:58Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Leveraging Hierarchical Feature Sharing for Efficient Dataset Condensation [38.59750970617013]
We propose a novel data parameterization architecture, Hierarchical Memory Network (HMN)
HMN stores condensed data in a three-tier structure, representing the dataset-level, class-level, and instance-level features.
We evaluate HMN on five public datasets and show that our proposed method outperforms all baselines.
arXiv Detail & Related papers (2023-10-11T14:02:11Z) - Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory [66.035487142452]
We show that trajectory-matching-based methods (MTT) can scale to large-scale datasets such as ImageNet-1K.
We propose a procedure to exactly compute the unrolled gradient with constant memory complexity, which allows us to scale MTT to ImageNet-1K seamlessly with 6x reduction in memory footprint.
The resulting algorithm sets new SOTA on ImageNet-1K: we can scale up to 50 IPCs (Image Per Class) on ImageNet-1K on a single GPU.
arXiv Detail & Related papers (2022-11-19T04:46:03Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.