Exploring Variance Reduction in Importance Sampling for Efficient DNN Training
- URL: http://arxiv.org/abs/2501.13296v1
- Date: Thu, 23 Jan 2025 00:43:34 GMT
- Title: Exploring Variance Reduction in Importance Sampling for Efficient DNN Training
- Authors: Takuro Kutsuna,
- Abstract summary: This paper proposes a method for estimating variance reduction during deep neural network (DNN) training using only minibatches sampled under importance sampling.
An absolute metric to quantify the efficiency of importance sampling is also introduced as well as an algorithm for real-time estimation of importance scores based on moving gradient statistics.
- Score: 1.7767466724342067
- License:
- Abstract: Importance sampling is widely used to improve the efficiency of deep neural network (DNN) training by reducing the variance of gradient estimators. However, efficiently assessing the variance reduction relative to uniform sampling remains challenging due to computational overhead. This paper proposes a method for estimating variance reduction during DNN training using only minibatches sampled under importance sampling. By leveraging the proposed method, the paper also proposes an effective minibatch size to enable automatic learning rate adjustment. An absolute metric to quantify the efficiency of importance sampling is also introduced as well as an algorithm for real-time estimation of importance scores based on moving gradient statistics. Theoretical analysis and experiments on benchmark datasets demonstrated that the proposed algorithm consistently reduces variance, improves training efficiency, and enhances model accuracy compared with current importance-sampling approaches while maintaining minimal computational overhead.
Related papers
- Gradient Descent Efficiency Index [0.0]
This study introduces a new efficiency metric, Ek, designed to quantify the effectiveness of each iteration.
The proposed metric accounts for both the relative change in error and the stability of the loss function across iterations.
Ek has the potential to guide more informed decisions in the selection and tuning of optimization algorithms in machine learning applications.
arXiv Detail & Related papers (2024-10-25T10:22:22Z) - STATE: A Robust ATE Estimator of Heavy-Tailed Metrics for Variance Reduction in Online Controlled Experiments [22.32661807469984]
We develop a novel framework that integrates the Student's t-distribution with machine learning tools to fit heavy-tailed metrics.
By adopting a variational EM method to optimize the loglikehood function, we can infer a robust solution that greatly eliminates the negative impact of outliers.
Both simulations on synthetic data and long-term empirical results on Meituan experiment platform demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-23T09:35:59Z) - Efficient Backpropagation with Variance-Controlled Adaptive Sampling [32.297478086982466]
Sampling-based algorithms, which eliminate ''unimportant'' computations during forward and/or back propagation (BP), offer potential solutions to accelerate neural network training.
We introduce a variance-controlled adaptive sampling (VCAS) method designed to accelerate BP.
VCAS can preserve the original training loss trajectory and validation accuracy with an up to 73.87% FLOPs reduction of BP and 49.58% FLOPs reduction of the whole training process.
arXiv Detail & Related papers (2024-02-27T05:40:36Z) - Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.
We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Sampling-free Variational Inference for Neural Networks with
Multiplicative Activation Noise [51.080620762639434]
We propose a more efficient parameterization of the posterior approximation for sampling-free variational inference.
Our approach yields competitive results for standard regression problems and scales well to large-scale image classification tasks.
arXiv Detail & Related papers (2021-03-15T16:16:18Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Bandit Samplers for Training Graph Neural Networks [63.17765191700203]
Several sampling algorithms with variance reduction have been proposed for accelerating the training of Graph Convolution Networks (GCNs)
These sampling algorithms are not applicable to more general graph neural networks (GNNs) where the message aggregator contains learned weights rather than fixed weights, such as Graph Attention Networks (GAT)
arXiv Detail & Related papers (2020-06-10T12:48:37Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - A Dynamic Sampling Adaptive-SGD Method for Machine Learning [8.173034693197351]
We propose a method that adaptively controls the batch size used in the computation of gradient approximations and the step size used to move along such directions.
The proposed method exploits local curvature information and ensures that search directions are descent directions with high probability.
Numerical experiments show that this method is able to choose the best learning rates and compares favorably to fine-tuned SGD for training logistic regression and DNNs.
arXiv Detail & Related papers (2019-12-31T15:36:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.