Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice
- URL: http://arxiv.org/abs/2407.07613v2
- Date: Mon, 19 May 2025 19:25:39 GMT
- Title: Randomness Helps Rigor: A Probabilistic Learning Rate Scheduler Bridging Theory and Deep Learning Practice
- Authors: Dahlia Devapriya, Thulasi Tholeti, Janani Suresh, Sheetal Kalyani,
- Abstract summary: We propose a probabilistic learning rate scheduler (PLRS)<n>PLRS does not conform to the monotonically decreasing condition, with provable convergence guarantees.<n>We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy.
- Score: 7.494722456816369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning rate schedulers have shown great success in speeding up the convergence of learning algorithms in practice. However, their convergence to a minimum has not been proven theoretically. This difficulty mainly arises from the fact that, while traditional convergence analysis prescribes to monotonically decreasing (or constant) learning rates, schedulers opt for rates that often increase and decrease through the training epochs. In this work, we aim to bridge the gap by proposing a probabilistic learning rate scheduler (PLRS) that does not conform to the monotonically decreasing condition, with provable convergence guarantees. To cement the relevance and utility of our work in modern day applications, we show experimental results on deep neural network architectures such as ResNet, WRN, VGG, and DenseNet on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets. We show that PLRS performs as well as or better than existing state-of-the-art learning rate schedulers in terms of convergence as well as accuracy. For example, while training ResNet-110 on the CIFAR-100 dataset, we outperform the state-of-the-art knee scheduler by $1.56\%$ in terms of classification accuracy. Furthermore, on the Tiny ImageNet dataset using ResNet-50 architecture, we show a significantly more stable convergence than the cosine scheduler and a better classification accuracy than the existing schedulers.
Related papers
- Orthogonal Soft Pruning for Efficient Class Unlearning [26.76186024947296]
We propose a class-aware soft pruning framework to achieve rapid and precise forgetting with millisecond-level response times.<n>Our method decorrelates convolutional filters and disentangles feature representations, while efficiently identifying class-specific channels.
arXiv Detail & Related papers (2025-06-24T09:52:04Z) - When Simple Model Just Works: Is Network Traffic Classification in Crisis? [0.0]
We show that a simple k-NN baseline using packet sequences metadata can be on par or even outperform more complex methods.<n>We argue that standard machine learning practices adapted from domains like NLP or computer vision may be ill-suited for network traffic classification.
arXiv Detail & Related papers (2025-06-10T10:11:05Z) - MAB-Based Channel Scheduling for Asynchronous Federated Learning in Non-Stationary Environments [12.404264058659429]
Federated learning enables distributed model training across clients without raw data exchange.
In wireless implementations, frequent parameter updates cause high communication overhead.
We propose an asynchronous federated learning scheduling framework to reduce client staleness while enhancing communication efficiency and fairness.
arXiv Detail & Related papers (2025-03-03T09:05:04Z) - Provable Contrastive Continual Learning [7.6989463205452555]
We establish theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks.
Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA.
Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2024-05-29T04:48:11Z) - Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch [72.26822499434446]
Auto-Train-Once (ATO) is an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs.
We provide a comprehensive convergence analysis as well as extensive experiments, and the results show that our approach achieves state-of-the-art performance across various model architectures.
arXiv Detail & Related papers (2024-03-21T02:33:37Z) - Relaxed Contrastive Learning for Federated Learning [48.96253206661268]
We propose a novel contrastive learning framework to address the challenges of data heterogeneity in federated learning.
Our framework outperforms all existing federated learning approaches by huge margins on the standard benchmarks.
arXiv Detail & Related papers (2024-01-10T04:55:24Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Uncertainty quantification for learned ISTA [5.706217259840463]
Algorithm unrolling schemes stand out among these model-based learning techniques.
They lack certainty estimates and a theory for uncertainty quantification is still elusive.
This work proposes a rigorous way to obtain confidence intervals for the LISTA estimator.
arXiv Detail & Related papers (2023-09-14T18:39:07Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - Unbiased and Efficient Self-Supervised Incremental Contrastive Learning [31.763904668737304]
We propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of a novel Incremental InfoNCE (NCE-II) loss function.
ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.
arXiv Detail & Related papers (2023-01-28T06:11:31Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Training Spiking Neural Networks with Local Tandem Learning [96.32026780517097]
Spiking neural networks (SNNs) are shown to be more biologically plausible and energy efficient than their predecessors.
In this paper, we put forward a generalized learning rule, termed Local Tandem Learning (LTL)
We demonstrate rapid network convergence within five training epochs on the CIFAR-10 dataset while having low computational complexity.
arXiv Detail & Related papers (2022-10-10T10:05:00Z) - MaxMatch: Semi-Supervised Learning with Worst-Case Consistency [149.03760479533855]
We propose a worst-case consistency regularization technique for semi-supervised learning (SSL)
We present a generalization bound for SSL consisting of the empirical loss terms observed on labeled and unlabeled training data separately.
Motivated by this bound, we derive an SSL objective that minimizes the largest inconsistency between an original unlabeled sample and its multiple augmented variants.
arXiv Detail & Related papers (2022-09-26T12:04:49Z) - Learning Rate Curriculum [75.98230528486401]
We propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC)
LeRaC uses a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs.
We compare our approach with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach.
arXiv Detail & Related papers (2022-05-18T18:57:36Z) - Revisiting Consistency Regularization for Semi-Supervised Learning [80.28461584135967]
We propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss.
Experimental results show that our model defines a new state of the art for various datasets and settings.
arXiv Detail & Related papers (2021-12-10T20:46:13Z) - Distributed Learning and its Application for Time-Series Prediction [0.0]
Extreme events are occurrences whose magnitude and potential cause extensive damage on people, infrastructure, and the environment.
Motivated by the extreme nature of the current global health landscape, which is plagued by the coronavirus pandemic, we seek to better understand and model extreme events.
arXiv Detail & Related papers (2021-06-06T18:57:30Z) - LRTuner: A Learning Rate Tuner for Deep Neural Networks [10.913790890826785]
The choice of learning rate schedule determines the computational cost getting close to a minima, how close you actually get to the minima, and most importantly the kind of local minima (wide/narrow) attained.
Current systems employ hand tuned learning rate schedules, which are painstakingly tuned for each network and dataset.
We present LRTuner, a method for tuning the learning rate schedule as training proceeds.
arXiv Detail & Related papers (2021-05-30T13:06:26Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Convolutional Neural Network Training with Distributed K-FAC [14.2773046188145]
Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix.
We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale.
arXiv Detail & Related papers (2020-07-01T22:00:53Z) - Identifying and Compensating for Feature Deviation in Imbalanced Deep
Learning [59.65752299209042]
We investigate learning a ConvNet under such a scenario.
We found that a ConvNet significantly over-fits the minor classes.
We propose to incorporate class-dependent temperatures (CDT) training ConvNet.
arXiv Detail & Related papers (2020-01-06T03:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.