Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
- URL: http://arxiv.org/abs/2510.27253v1
- Date: Fri, 31 Oct 2025 07:41:41 GMT
- Title: Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation
- Authors: Qiyan Deng, Changqian Zheng, Lianpeng Qiao, Yuping Wang, Chengliang Chai, Lei Cao,
- Abstract summary: We present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to account for data quality in the distillation process.<n>IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones.<n>Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
- Score: 10.625826589163252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
Related papers
- Rectifying Soft-Label Entangled Bias in Long-Tailed Dataset Distillation [39.47633542394261]
We emphasize the critical role of soft labels in long-tailed dataset distillation.<n>We derive an imbalance-aware generalization bound for model trained on distilled dataset.<n>We then identify two primary sources of soft-label bias, which originate from the distillation model and the distilled images.<n>We propose ADSA, an Adaptive Soft-label Alignment module that calibrates the entangled biases.
arXiv Detail & Related papers (2025-11-22T04:37:27Z) - DD-Ranking: Rethinking the Evaluation of Dataset Distillation [314.9621366437238]
We propose DD-Ranking, a unified evaluation framework, along with new general evaluation metrics to uncover the true performance improvements achieved by different methods.<n>By refocusing on the actual information enhancement of distilled datasets, DD-Ranking provides a more comprehensive and fair evaluation standard for future research advancements.
arXiv Detail & Related papers (2025-05-19T16:19:50Z) - Dataset Distillation via Committee Voting [21.018818924580877]
We introduce $bf C$ommittee $bf V$oting for $bf D$ataset $bf D$istillation (CV-DD)<n>CV-DD is a novel approach that leverages the collective wisdom of multiple models or experts to create high-quality distilled datasets.
arXiv Detail & Related papers (2025-01-13T18:59:48Z) - Mitigating Bias in Dataset Distillation [62.79454960378792]
We study the impact of bias inside the original dataset on the performance of dataset distillation.
We introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation.
arXiv Detail & Related papers (2024-06-06T18:52:28Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - Evaluating the effect of data augmentation and BALD heuristics on
distillation of Semantic-KITTI dataset [63.20765930558542]
Active Learning has remained relatively unexplored for LiDAR perception tasks in autonomous driving datasets.
We evaluate Bayesian active learning methods applied to the task of dataset distillation or core subset selection.
We also study the effect of application of data augmentation within Bayesian AL based dataset distillation.
arXiv Detail & Related papers (2023-02-21T13:56:47Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.