Feature Whitening via Gradient Transformation for Improved Convergence
- URL: http://arxiv.org/abs/2010.01546v2
- Date: Sun, 8 Nov 2020 08:35:17 GMT
- Title: Feature Whitening via Gradient Transformation for Improved Convergence
- Authors: Shmulik Markovich-Golan, Barak Battash, Amit Bleiweiss
- Abstract summary: We address the complexity drawbacks of feature whitening.
We derive an equivalent method, which replaces the sample transformations by a transformation to the weight gradients, applied to every batch of B samples.
We exemplify the proposed algorithms with ResNet-based networks for image classification demonstrated on the CIFAR and Imagenet datasets.
- Score: 3.5579740292581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Feature whitening is a known technique for speeding up training of DNN. Under
certain assumptions, whitening the activations reduces the Fisher information
matrix to a simple identity matrix, in which case stochastic gradient descent
is equivalent to the faster natural gradient descent. Due to the additional
complexity resulting from transforming the layer inputs and their corresponding
gradients in the forward and backward propagation, and from repeatedly
computing the Eigenvalue decomposition (EVD), this method is not commonly used
to date. In this work, we address the complexity drawbacks of feature
whitening. Our contribution is twofold. First, we derive an equivalent method,
which replaces the sample transformations by a transformation to the weight
gradients, applied to every batch of B samples. The complexity is reduced by a
factor of S=(2B), where S denotes the feature dimension of the layer output. As
the batch size increases with distributed training, the benefit of using the
proposed method becomes more compelling. Second, motivated by the theoretical
relation between the condition number of the sample covariance matrix and the
convergence speed, we derive an alternative sub-optimal algorithm which
recursively reduces the condition number of the latter matrix. Compared to EVD,
complexity is reduced by a factor of the input feature dimension M. We
exemplify the proposed algorithms with ResNet-based networks for image
classification demonstrated on the CIFAR and Imagenet datasets. Parallelizing
the proposed algorithms is straightforward and we implement a distributed
version thereof. Improved convergence, in terms of speed and attained accuracy,
can be observed in our experiments.
Related papers
- Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo [30.4930148381328]
Diffusion-based Monte Carlo (DMC) is a method to sample from a general target distribution beyond the isoperimetric condition.
DMC encountered high gradient complexity, resulting in an exponential dependency on the error tolerance $epsilon$ of the obtained samples.
We propose RS-DMC, based on a novel recursion-based score estimation method.
Our algorithm is provably much faster than the popular Langevin-based algorithms.
arXiv Detail & Related papers (2024-01-12T02:33:57Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Optimal Input Gain: All You Need to Supercharge a Feed-Forward Neural
Network [0.6562256987706128]
It is shown that pre-processing inputs using linear transformation are equivalent to multiplying the negative gradient matrix with an autocorrelation matrix per training iteration.
It is shown that OIG improved HWO could be a significant building block to more complex deep learning architectures.
arXiv Detail & Related papers (2023-03-30T22:20:16Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Communication-Efficient Federated Learning via Quantized Compressed
Sensing [82.10695943017907]
The presented framework consists of gradient compression for wireless devices and gradient reconstruction for a parameter server.
Thanks to gradient sparsification and quantization, our strategy can achieve a higher compression ratio than one-bit gradient compression.
We demonstrate that the framework achieves almost identical performance with the case that performs no compression.
arXiv Detail & Related papers (2021-11-30T02:13:54Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.