Coresets from Trajectories: Selecting Data via Correlation of Loss Differences
- URL: http://arxiv.org/abs/2508.20230v1
- Date: Wed, 27 Aug 2025 19:18:39 GMT
- Title: Coresets from Trajectories: Selecting Data via Correlation of Loss Differences
- Authors: Manish Nagaraj, Deepak Ravikumar, Kaushik Roy,
- Abstract summary: Correlation of Loss Differences (CLD) is a scalable metric for coreset selection.<n>On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods.
- Score: 14.31847187460321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.
Related papers
- UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective [17.593940249922557]
We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods.<n>We scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets.<n>Our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K.
arXiv Detail & Related papers (2025-11-17T05:17:39Z) - Grad-CL: Source Free Domain Adaptation with Gradient Guided Feature Disalignment [3.2371089062298317]
Grad-CL is a novel source-free domain adaptation framework.<n>It adapts segmentation performance without requiring access to original source data.<n>It outperforms state-of-the-art unsupervised and source-free domain adaptation methods.
arXiv Detail & Related papers (2025-09-12T10:51:46Z) - Distributionally Robust Optimization with Adversarial Data Contamination [49.89480853499918]
We focus on optimizing Wasserstein-1 DRO objectives for generalized linear models with convex Lipschitz loss functions.<n>Our primary contribution lies in a novel modeling framework that integrates robustness against training data contamination with robustness against distributional shifts.<n>This work establishes the first rigorous guarantees, supported by efficient computation, for learning under the dual challenges of data contamination and distributional shifts.
arXiv Detail & Related papers (2025-07-14T18:34:10Z) - Finding the Muses: Identifying Coresets through Loss Trajectories [7.293244528299574]
Loss Trajectory Correlation (LTC) is a novel metric for coreset selection that identifies critical training samples driving generalization.<n>$LTC$ consistently achieves accuracy on par with or surpassing state-of-the-art coreset selection methods.<n>It also offers insights into training dynamics, such as identifying aligned and conflicting sample behaviors.
arXiv Detail & Related papers (2025-03-12T18:11:16Z) - Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ)<n>We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score.<n>Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z) - Low Saturation Confidence Distribution-based Test-Time Adaptation for Cross-Domain Remote Sensing Image Classification [4.7514513970228425]
Unsupervised Domain Adaptation (UDA) has emerged as a powerful technique for addressing the distribution shift across various Remote Sensing (RS) applications.<n>Most UDA approaches require access to source data, which may be infeasible due to data privacy or transmission constraints.<n>Low Saturation Confidence Distribution Test-Time Adaptation (D-TTA) marketing the first attempt to explore Test-Time Adaptation for cross-domain RS image classification.
arXiv Detail & Related papers (2024-08-29T05:04:25Z) - Small Object Detection via Coarse-to-fine Proposal Generation and
Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.
CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z) - SEMI-CenterNet: A Machine Learning Facilitated Approach for
Semiconductor Defect Inspection [0.10555513406636088]
We have proposed SEMI-CenterNet (SEMI-CN), a customized CN architecture trained on SEM images of semiconductor wafer defects.
SEMI-CN gets trained to output the center, class, size, and offset of a defect instance.
We train SEMI-CN on two datasets and benchmark two ResNet backbones for the framework.
arXiv Detail & Related papers (2023-08-14T14:39:06Z) - Divide and Contrast: Source-free Domain Adaptation via Adaptive
Contrastive Learning [122.62311703151215]
Divide and Contrast (DaC) aims to connect the good ends of both worlds while bypassing their limitations.
DaC divides the target data into source-like and target-specific samples, where either group of samples is treated with tailored goals.
We further align the source-like domain with the target-specific samples using a memory bank-based Maximum Mean Discrepancy (MMD) loss to reduce the distribution mismatch.
arXiv Detail & Related papers (2022-11-12T09:21:49Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Test-time Batch Statistics Calibration for Covariate Shift [66.7044675981449]
We propose to adapt the deep models to the novel environment during inference.
We present a general formulation $alpha$-BN to calibrate the batch statistics.
We also present a novel loss function to form a unified test time adaptation framework Core.
arXiv Detail & Related papers (2021-10-06T08:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.