Related papers: Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need

Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need

URL: http://arxiv.org/abs/2507.04269v1
Date: Sun, 06 Jul 2025 07:02:04 GMT
Title: Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need
Authors: Mohammadreza Sharifi, Ahad Harati,
Abstract summary: In this paper, we present the Guided Spectrally Tuned Data Selection ( GSTDS) algorithm.<n> GSTDS dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model.<n>It achieves notable reductions in computational requirements, up to four times, without compromising performance.
Score: 0.30693357740321775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at https://github.com/rezasharifi82/GSTDS.

Related papers

Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise V-information(PVI)<n>Experiments demonstrate that removing 10%-30% of the data preserves the classifier performance with only a 0.0001% to 0.76% loss in accuracy.<n>We have transferred the PVI framework, which previously applied only to English datasets, to diverse Chinese NLP tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z)
Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN) [0.0]
This paper introduces SALN, a method designed to prioritize and select samples within each batch rather than from the entire dataset.<n>The proposed method applies a spectral analysis-based to identify the most informative data points within each batch, improving both training speed and accuracy.<n>It demonstrates up to an 8x reduction in training time and up to a 5% increase in accuracy over standard training methods.
arXiv Detail & Related papers (2024-12-22T15:38:36Z)
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search [30.88999109835329]
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation.<n>We observe that only a subset of the data in constructed datasets plays a decisive role.<n>We introduce a new Filtering-WoRA paradigm, which contains a filtering algorithm to identify this crucial data subset and WoRA learning strategy for light fine-tuning.
arXiv Detail & Related papers (2024-04-16T05:29:14Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks. We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process. Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Exploring Data Redundancy in Real-world Image Classification through Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z)
Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution. We extend the classic importance resampling approach used in low-dimensions for LM data selection. We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Weakly Supervised Change Detection Using Guided Anisotropic Difusion [97.43170678509478]
We propose original ideas that help us to leverage such datasets in the context of change detection. First, we propose the guided anisotropic diffusion (GAD) algorithm, which improves semantic segmentation results. We then show its potential in two weakly-supervised learning strategies tailored for change detection.
arXiv Detail & Related papers (2021-12-31T10:03:47Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.