Towards Understanding Label Smoothing
        - URL: http://arxiv.org/abs/2006.11653v2
- Date: Sat, 3 Oct 2020 03:05:47 GMT
- Title: Towards Understanding Label Smoothing
- Authors: Yi Xu, Yuanhong Xu, Qi Qian, Hao Li, Rong Jin
- Abstract summary: Label smoothing regularization (LSR) has a great success in deep neural networks by training algorithms.
We show that an appropriate LSR can help to speed up convergence by reducing the variance.
We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA)
- Score: 36.54164997035046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Label smoothing regularization (LSR) has a great success in training deep
neural networks by stochastic algorithms such as stochastic gradient descent
and its variants. However, the theoretical understanding of its power from the
view of optimization is still rare. This study opens the door to a deep
understanding of LSR by initiating the analysis. In this paper, we analyze the
convergence behaviors of stochastic gradient descent with label smoothing
regularization for solving non-convex problems and show that an appropriate LSR
can help to speed up the convergence by reducing the variance. More
interestingly, we proposed a simple yet effective strategy, namely Two-Stage
LAbel smoothing algorithm (TSLA), that uses LSR in the early training epochs
and drops it off in the later training epochs. We observe from the improved
convergence result of TSLA that it benefits from LSR in the first stage and
essentially converges faster in the second stage. To the best of our knowledge,
this is the first work for understanding the power of LSR via establishing
convergence complexity of stochastic methods with LSR in non-convex
optimization. We empirically demonstrate the effectiveness of the proposed
method in comparison with baselines on training ResNet models over benchmark
data sets.
 
      
        Related papers
        - A Trainable Optimizer [18.195022468462753]
 We present a framework that jointly trains the full gradient estimator and the trainable weights of the model.<n>Pseudo-linear TO incurs negligible computational overhead, requiring only minimal additional multiplications.<n> Experiments demonstrate that TO methods converge faster than benchmark algorithms.
 arXiv  Detail & Related papers  (2025-08-03T14:06:07Z)
- Beyond First-Order: Training LLMs with Stochastic Conjugate Subgradients   and AdamW [2.028622227373579]
 gradient-based descent (SGD) have long been central to training large language models (LLMs)<n>This paper proposes a conjugate subgradient method together with adaptive sampling specifically for training LLMs.
 arXiv  Detail & Related papers  (2025-07-01T23:30:15Z)
- A Triple-Inertial Accelerated Alternating Optimization Method for Deep   Learning Training [3.246129789918632]
 gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models.
 alternating minimization (AM) methods have emerged as a promising alternative for model training.
We propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training.
 arXiv  Detail & Related papers  (2025-03-11T14:42:17Z)
- Learning Provably Improves the Convergence of Gradient Descent [9.82454981262489]
 We study the convergence of Learning to Optimize (L2O) problems by training-based solvers.
An algorithm's tangent significantly enhances L2O's convergence.
Our findings indicate 50% outperformance over the GD methods.
 arXiv  Detail & Related papers  (2025-01-30T02:03:30Z)
- Adaptive Federated Learning Over the Air [108.62635460744109]
 We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha   ).
 arXiv  Detail & Related papers  (2024-03-11T09:10:37Z)
- OptEx: Expediting First-Order Optimization with Approximately   Parallelized Iterations [12.696136981847438]
 We introduce first-order optimization expedited with approximately parallelized iterations (OptEx)
OptEx is the first framework that enhances the efficiency of FOO by leveraging parallel computing to mitigate its iterative bottleneck.
We provide theoretical guarantees for the reliability of our kernelized gradient estimation and the complexity of SGD-based OptEx.
 arXiv  Detail & Related papers  (2024-02-18T02:19:02Z)
- Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
 This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
 arXiv  Detail & Related papers  (2023-10-20T12:45:12Z)
- Stochastic Unrolled Federated Learning [85.6993263983062]
 We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
 arXiv  Detail & Related papers  (2023-05-24T17:26:22Z)
- Loop Unrolled Shallow Equilibrium Regularizer (LUSER) -- A
  Memory-Efficient Inverse Problem Solver [26.87738024952936]
 In inverse problems we aim to reconstruct some underlying signal of interest from potentially corrupted and often ill-posed measurements.
We propose an LU algorithm with shallow equilibrium regularizers (L)
These implicit models are as expressive as deeper convolutional networks, but far more memory efficient during training.
 arXiv  Detail & Related papers  (2022-10-10T19:50:37Z)
- Learning Neural Network Quantum States with the Linear Method [0.0]
 We show that the linear method can be used successfully for the optimization of complex valued neural network quantum states.
We compare the LM to the state-of-the-art SR algorithm and find that the LM requires up to an order of magnitude fewer iterations for convergence.
 arXiv  Detail & Related papers  (2021-04-22T12:18:33Z)
- Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
 We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
 arXiv  Detail & Related papers  (2021-04-09T14:50:59Z)
- Neurally Augmented ALISTA [15.021419552695066]
 We introduce Neurally Augmented ALISTA, in which an LSTM network is used to compute step sizes and thresholds individually for each target vector during reconstruction.
We show that our approach further improves empirical performance in sparse reconstruction, in particular outperforming existing algorithms by an increasing margin as the compression ratio becomes more challenging.
 arXiv  Detail & Related papers  (2020-10-05T11:39:49Z)
- Regularized linear autoencoders recover the principal components,
  eventually [15.090789983727335]
 We show that when trained with proper regularization, linear autoencoders can learn the optimal representation.
We show that this convergence is slow due to ill-conditioning that worsens with increasing latent dimension.
We present a simple modification to the gradient descent update that greatly speeds up empirically.
 arXiv  Detail & Related papers  (2020-07-13T23:08:25Z)
- Convergence of Meta-Learning with Task-Specific Adaptation over Partial
  Parameters [152.03852111442114]
 Although model-agnostic metalearning (MAML) is a very successful algorithm meta-learning practice, it can have high computational complexity.
Our paper shows that such complexity can significantly affect the overall convergence performance of ANIL.
 arXiv  Detail & Related papers  (2020-06-16T19:57:48Z)
- On Learning Rates and Schr\"odinger Operators [105.32118775014015]
 We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
 arXiv  Detail & Related papers  (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.