RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm
- URL: http://arxiv.org/abs/2010.05397v3
- Date: Thu, 15 Oct 2020 16:02:28 GMT
- Title: RNN Training along Locally Optimal Trajectories via Frank-Wolfe
Algorithm
- Authors: Yun Yue, Ming Li, Venkatesh Saligrama, Ziming Zhang
- Abstract summary: We propose a novel and efficient training method for RNNs by iteratively seeking a local minima on the loss surface within a small region.
We develop a novel RNN training method that, surprisingly, even with the additional cost, the overall training cost is empirically observed to be lower than back-propagation.
- Score: 50.76576946099215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel and efficient training method for RNNs by iteratively
seeking a local minima on the loss surface within a small region, and leverage
this directional vector for the update, in an outer-loop. We propose to utilize
the Frank-Wolfe (FW) algorithm in this context. Although, FW implicitly
involves normalized gradients, which can lead to a slow convergence rate, we
develop a novel RNN training method that, surprisingly, even with the
additional cost, the overall training cost is empirically observed to be lower
than back-propagation. Our method leads to a new Frank-Wolfe method, that is in
essence an SGD algorithm with a restart scheme. We prove that under certain
conditions our algorithm has a sublinear convergence rate of $O(1/\epsilon)$
for $\epsilon$ error. We then conduct empirical experiments on several
benchmark datasets including those that exhibit long-term dependencies, and
show significant performance improvement. We also experiment with deep RNN
architectures and show efficient training performance. Finally, we demonstrate
that our training method is robust to noisy data.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Improving Representational Continuity via Continued Pretraining [76.29171039601948]
Transfer learning community (LP-FT) outperforms naive training and other continual learning methods.
LP-FT also reduces forgetting in a real world satellite remote sensing dataset (FMoW)
variant of LP-FT gets state-of-the-art accuracies on an NLP continual learning benchmark.
arXiv Detail & Related papers (2023-02-26T10:39:38Z) - Using Taylor-Approximated Gradients to Improve the Frank-Wolfe Method
for Empirical Risk Minimization [1.4504054468850665]
In Empirical Minimization -- Minimization -- we present a novel computational step-size approach for which we have computational guarantees.
We show that our methods exhibit very significant problems on realworld binary datasets.
We also present a novel adaptive step-size approach for which we have computational guarantees.
arXiv Detail & Related papers (2022-08-30T00:08:37Z) - DNNR: Differential Nearest Neighbors Regression [8.667550264279166]
K-nearest neighbors (KNN) is one of the earliest and most established algorithms in machine learning.
For regression tasks, KNN averages the targets within a neighborhood which poses a number of challenges.
We propose Differential Nearest Neighbors Regression (DNNR) that addresses both issues simultaneously.
arXiv Detail & Related papers (2022-05-17T15:22:53Z) - AdaSTE: An Adaptive Straight-Through Estimator to Train Binary Neural
Networks [34.263013539187355]
We propose a new algorithm for training deep neural networks (DNNs) with binary weights.
Experimental results demonstrate that our new algorithm offers favorable performance compared to existing approaches.
arXiv Detail & Related papers (2021-12-06T09:12:15Z) - Efficient Neural Network Training via Forward and Backward Propagation
Sparsification [26.301103403328312]
We propose an efficient sparse training method with completely sparse forward and backward passes.
Our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
arXiv Detail & Related papers (2021-11-10T13:49:47Z) - Regularized Frank-Wolfe for Dense CRFs: Generalizing Mean Field and
Beyond [19.544213396776268]
We introduce regularized Frank-Wolfe, a general and effective CNN baseline inference for dense conditional fields.
We show that our new algorithms, with our new algorithms, with our new datasets, with significant improvements in strong strong neural networks.
arXiv Detail & Related papers (2021-10-27T20:44:47Z) - Local Critic Training for Model-Parallel Learning of Deep Neural
Networks [94.69202357137452]
We propose a novel model-parallel learning method, called local critic training.
We show that the proposed approach successfully decouples the update process of the layer groups for both convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
We also show that trained networks by the proposed method can be used for structural optimization.
arXiv Detail & Related papers (2021-02-03T09:30:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.