Related papers: Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)

Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)

URL: http://arxiv.org/abs/2412.17069v1
Date: Sun, 22 Dec 2024 15:38:36 GMT
Title: Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)
Authors: Mohammadreza Sharifi,
Abstract summary: This paper introduces SALN, a method designed to prioritize and select samples within each batch rather than from the entire dataset.<n>The proposed method applies a spectral analysis-based to identify the most informative data points within each batch, improving both training speed and accuracy.<n>It demonstrates up to an 8x reduction in training time and up to a 5% increase in accuracy over standard training methods.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In modern deep learning models, long training times and large datasets present significant challenges to both efficiency and scalability. Effective data curation and sample selection are crucial for optimizing the training process of deep neural networks. This paper introduces SALN, a method designed to prioritize and select samples within each batch rather than from the entire dataset. By utilizing jointly selected batches, SALN enhances training efficiency compared to independent batch selection. The proposed method applies a spectral analysis-based heuristic to identify the most informative data points within each batch, improving both training speed and accuracy. The SALN algorithm significantly reduces training time and enhances accuracy when compared to traditional batch prioritization or standard training procedures. It demonstrates up to an 8x reduction in training time and up to a 5\% increase in accuracy over standard training methods. Moreover, SALN achieves better performance and shorter training times compared to Google's JEST method developed by DeepMind.

Related papers

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning [49.04912820721943]
Supervised fine-tuning (SFT) is computationally expensive and sometimes suffers from overfitting or bias amplification.<n>This work studies the online batch selection family that dynamically scores and filters samples during the training process.<n>We develop textbfUDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT.
arXiv Detail & Related papers (2025-10-19T15:32:01Z)
Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need [0.30693357740321775]
In this paper, we present the Guided Spectrally Tuned Data Selection ( GSTDS) algorithm.<n> GSTDS dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model.<n>It achieves notable reductions in computational requirements, up to four times, without compromising performance.
arXiv Detail & Related papers (2025-07-06T07:02:04Z)
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning [22.242445543184264]
We propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop.<n>Experiments show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
arXiv Detail & Related papers (2025-05-12T10:57:51Z)
Pruning-based Data Selection and Network Fusion for Efficient Deep Learning [13.900633576526863]
PruneFuse is a novel method that combines pruning and network fusion to enhance data selection and accelerate training. In PruneFuse, the original dense network is pruned to generate a smaller surrogate model that efficiently selects the most informative samples from the dataset.
arXiv Detail & Related papers (2025-01-02T07:35:53Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.<n>We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.<n>Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks. We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process. Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z)
D4: Improving LLM Pretraining via Document De-Duplication and Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
Benchmarking Neural Network Training Algorithms [52.890134877995195]
Training algorithms are an essential part of every deep learning pipeline.<n>As a community, we are unable to reliably identify training algorithm improvements.<n>We introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
arXiv Detail & Related papers (2023-06-12T15:21:02Z)
Repeated Random Sampling for Minimizing the Time-to-Accuracy of Learning [28.042568086423298]
Repeated Sampling of Random Subsets (RS2) is a powerful yet overlooked random sampling strategy. We test RS2 against thirty state-of-the-art data pruning and data distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy compared to existing techniques.
arXiv Detail & Related papers (2023-05-28T20:38:13Z)
Efficient human-in-loop deep learning model training with iterative refinement and statistical result validation [0.0]
We demonstrate a method for creating segmentations, a necessary part of a data cleaning for ultrasound imaging machine learning pipelines. We propose a four-step method to leverage automatically generated training data and fast human visual checks to improve model accuracy while keeping the time/effort and cost low. The method is demonstrated on a cardiac ultrasound segmentation task, removing background data, including static PHI.
arXiv Detail & Related papers (2023-04-03T13:56:01Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
Subset Sampling For Progressive Neural Network Learning [106.12874293597754]
Progressive Neural Network Learning is a class of algorithms that incrementally construct the network's topology and optimize its parameters based on the training data. We propose to speed up this process by exploiting subsets of training data at each incremental training step. Experimental results in object, scene and face recognition problems demonstrate that the proposed approach speeds up the optimization procedure considerably.
arXiv Detail & Related papers (2020-02-17T18:57:33Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.