Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
- URL: http://arxiv.org/abs/2511.17914v1
- Date: Sat, 22 Nov 2025 04:37:27 GMT
- Title: Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation
- Authors: Chenyang Jiang, Hang Zhao, Xinyu Zhang, Zhengcen Li, Qiben Shan, Shaocong Wu, Jingyong Su,
- Abstract summary: We emphasize the critical role of soft labels in long-tailed dataset distillation.<n>We derive an imbalance-aware generalization bound for model trained on distilled dataset.<n>We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images.<n>We propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases.
- Score: 39.47633542394261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation compresses large-scale datasets into compact, highly informative synthetic data, significantly reducing storage and training costs. However, existing research primarily focuses on balanced datasets and struggles to perform under real-world long-tailed distributions. In this work, we emphasize the critical role of soft labels in long-tailed dataset distillation and uncover the underlying mechanisms contributing to performance degradation. Specifically, we derive an imbalance-aware generalization bound for model trained on distilled dataset. We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images, through systematic perturbation of the data imbalance levels. To address this, we propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases. This lightweight module integrates seamlessly into existing distillation pipelines and consistently improves performance. On ImageNet-1k-LT with EDC and IPC=50, ADSA improves tail-class accuracy by up to 11.8% and raises overall accuracy to 41.4%. Extensive experiments demonstrate that ADSA provides a robust and generalizable solution under limited label budgets and across a range of distillation techniques. Code is available at: https://github.com/j-cyoung/ADSA_DD.git.
Related papers
- Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling [105.8570596633629]
We rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods.<n>We adopt the statistical alignment perspective to jointly model bias and restore fair supervision.<n>Our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT.
arXiv Detail & Related papers (2025-11-24T07:57:01Z) - DD-Ranking: Rethinking the Evaluation of Dataset Distillation [314.9621366437238]
We propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods.<n>By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
arXiv Detail & Related papers (2025-05-19T16:19:50Z) - Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD)<n>CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z) - Distilling Long-tailed Datasets [13.330572317331198]
We propose Distribution-agnostic Matching to avoid directly matching the biased expert trajectories.<n>We also improve the distillation guidance with Expert Decoupling to improve the tail class performance.<n>This work pioneers the field of long-tailed dataset distillation, marking the first effective effort to distill long-tailed datasets.
arXiv Detail & Related papers (2024-08-24T15:36:36Z) - Mitigating Bias in Dataset Distillation [62.79454960378792]
We study the impact of bias inside the original dataset on the performance of dataset distillation.
We introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation.
arXiv Detail & Related papers (2024-06-06T18:52:28Z) - Exploring the potential of prototype-based soft-labels data distillation for imbalanced data classification [0.0]
Main goal is to push further the performance of prototype-based soft-labels distillation in terms of classification accuracy.
Experimental studies trace the capability of the method to distill the data, but also the opportunity to act as an augmentation method.
arXiv Detail & Related papers (2024-03-25T19:15:19Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.