Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling
- URL: http://arxiv.org/abs/2511.18858v1
- Date: Mon, 24 Nov 2025 07:57:01 GMT
- Title: Rethinking Long-tailed Dataset Distillation: A Uni-Level Framework with Unbiased Recovery and Relabeling
- Authors: Xiao Cui, Yulei Qin, Xinyue Li, Wengang Zhou, Hongsheng Li, Houqiang Li,
- Abstract summary: We rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods.<n>We adopt the statistical alignment perspective to jointly model bias and restore fair supervision.<n>Our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT.
- Score: 105.8570596633629
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dataset distillation creates a small distilled set that enables efficient training by capturing key information from the full dataset. While existing dataset distillation methods perform well on balanced datasets, they struggle under long-tailed distributions, where imbalanced class frequencies induce biased model representations and corrupt statistical estimates such as Batch Normalization (BN) statistics. In this paper, we rethink long-tailed dataset distillation by revisiting the limitations of trajectory-based methods, and instead adopt the statistical alignment perspective to jointly mitigate model bias and restore fair supervision. To this end, we introduce three dedicated components that enable unbiased recovery of distilled images and soft relabeling: (1) enhancing expert models (an observer model for recovery and a teacher model for relabeling) to enable reliable statistics estimation and soft-label generation; (2) recalibrating BN statistics via a full forward pass with dynamically adjusted momentum to reduce representation skew; (3) initializing synthetic images by incrementally selecting high-confidence and diverse augmentations via a multi-round mechanism that promotes coverage and diversity. Extensive experiments on four long-tailed benchmarks show consistent improvements over state-of-the-art methods across varying degrees of class imbalance.Notably, our approach improves top-1 accuracy by 15.6% on CIFAR-100-LT and 11.8% on Tiny-ImageNet-LT under IPC=10 and IF=10.
Related papers
- Optimizing Distributional Geometry Alignment with Optimal Transport for Generative Dataset Distillation [109.13471554184554]
We reformulate dataset distillation as an Optimal Transport (OT) distance minimization problem.<n>OT offers a geometrically faithful framework for distribution matching.<n>Our method consistently outperforms state-of-the-art approaches in an efficient manner.
arXiv Detail & Related papers (2025-11-29T04:04:05Z) - Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation [39.47633542394261]
We emphasize the critical role of soft labels in long-tailed dataset distillation.<n>We derive an imbalance-aware generalization bound for model trained on distilled dataset.<n>We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images.<n>We propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases.
arXiv Detail & Related papers (2025-11-22T04:37:27Z) - Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models [22.094181812322574]
Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources.<n>We introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models.<n> Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time.
arXiv Detail & Related papers (2025-09-18T09:25:51Z) - Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting [3.1959623025848405]
Deep neural networks often rely on spurious correlations in training data, leading to biased or unfair predictions in safety-critical domains such as medicine and autonomous driving.<n>Recent advances in machine unlearning provide a promising alternative for post-hoc model correction.
arXiv Detail & Related papers (2025-09-09T07:25:51Z) - MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models [50.2406741245418]
We propose a mode-guided diffusion model leveraging a pre-trained diffusion model.<n>Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples.<n>Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.
arXiv Detail & Related papers (2025-05-25T03:40:23Z) - Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD)<n>CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z) - Generative Dataset Distillation Based on Self-knowledge Distillation [49.20086587208214]
We present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits.<n>Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data.<n>Our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
arXiv Detail & Related papers (2025-01-08T00:43:31Z) - Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results.
It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet.
We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z) - Dynamic Sub-graph Distillation for Robust Semi-supervised Continual Learning [47.64252639582435]
We focus on semi-supervised continual learning (SSCL), where the model progressively learns from partially labeled data with unknown categories.<n>We propose a novel approach called Dynamic Sub-Graph Distillation (DSGD) for semi-supervised continual learning.
arXiv Detail & Related papers (2023-12-27T04:40:12Z) - Data-iterative Optimization Score Model for Stable Ultra-Sparse-View CT
Reconstruction [2.2336243882030025]
We propose an iterative optimization data scoring model (DOSM) for sparse-view CT reconstruction.
DOSM integrates data consistency into its data consistency element, effectively balancing measurement data and generative model constraints.
We leverage conventional techniques to optimize DOSM updates.
arXiv Detail & Related papers (2023-08-28T09:23:18Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.