Related papers: Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates

Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates

URL: http://arxiv.org/abs/2210.08494v2
Date: Tue, 12 Sep 2023 16:41:48 GMT
Title: Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates
Authors: Constantin Octavian Puiu
Abstract summary: We exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques. We propose a K-factor inverse update which scales linearly in layer size. We also propose an inverse application procedure which scales linearly as well.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: K-FAC (arXiv:1503.05671, arXiv:1602.01407) is a tractable implementation of Natural Gradient (NG) for Deep Learning (DL), whose bottleneck is computing the inverses of the so-called ``Kronecker-Factors'' (K-factors). RS-KFAC (arXiv:2206.15397) is a K-FAC improvement which provides a cheap way of estimating the K-factors inverses. In this paper, we exploit the exponential-average construction paradigm of the K-factors, and use online numerical linear algebra techniques to propose an even cheaper (but less accurate) way of estimating the K-factors inverses. In particular, we propose a K-factor inverse update which scales linearly in layer size. We also propose an inverse application procedure which scales linearly as well (the one of K-FAC scales cubically and the one of RS-KFAC scales quadratically). Overall, our proposed algorithm gives an approximate K-FAC implementation whose preconditioning part scales linearly in layer size (compare to cubic for K-FAC and quadratic for RS-KFAC). Importantly however, this update is only applicable in some circumstances (typically for all FC layers), unlike the RS-KFAC approach (arXiv:2206.15397). Numerical results show RS-KFAC's inversion error can be reduced with minimal CPU overhead by adding our proposed update to it. Based on the proposed procedure, a correction to it, and RS-KFAC, we propose three practical algorithms for optimizing generic Deep Neural Nets. Numerical results show that two of these outperform RS-KFAC for any target test accuracy on CIFAR10 classification with a slightly modified version of VGG16_bn. Our proposed algorithms achieve 91$\%$ test accuracy faster than SENG (the state of art implementation of empirical NG for DL; arXiv:2006.05924) but underperform it for higher test-accuracy.

Related papers

RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE) RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers. Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Structured Inverse-Free Natural Gradient: Memory-Efficient & Numerically-Stable KFAC [26.275682325827706]
Second-order methods such as KFAC can be useful for neural net training. However, they are often memory-inefficient since their preconditioning Kronecker factors are dense. We formulate an inverse-free KFAC update and impose structures in the Kronecker factors, resulting in structured inverse-free gradient natural descent.
arXiv Detail & Related papers (2023-12-09T23:13:32Z)
Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC) We show they are exact for deep linear networks with weight-sharing in their respective setting. We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z)
Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning [19.04755792575149]
We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training. DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
arXiv Detail & Related papers (2022-06-30T09:22:25Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Large-scale Optimization of Partial AUC in a Range of False Positive Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique. Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z)
Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization [0.913755431537592]
We show that Kronecker-Factored, block-diagonal curvature estimates (KFAC) significantly outperforms true second-order updates. We also show that KFAC approximates a first-order gradient algorithm, which performs a gradient descent on rather than weights.
arXiv Detail & Related papers (2022-01-28T17:06:26Z)
Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition [0.0]
It is shown that natural gradient descent can minimize the objective function more efficiently than ordinary gradient descent based methods. The bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed
arXiv Detail & Related papers (2022-01-25T12:56:17Z)
Stochastic Optimization of Areas Under Precision-Recall Curves with Provable Convergence [66.83161885378192]
Area under ROC (AUROC) and precision-recall curves (AUPRC) are common metrics for evaluating classification performance for imbalanced problems. We propose a technical method to optimize AUPRC for deep learning.
arXiv Detail & Related papers (2021-04-18T06:22:21Z)
An iterative K-FAC algorithm for Deep Learning [0.0]
Key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC -- an new iterative K-FAC algorithm. We prove that the time and memory complexity of iterative CG-FAC is much less than that of standard K-FAC algorithm.
arXiv Detail & Related papers (2021-01-01T12:04:01Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
ScopeFlow: Dynamic Scene Scoping for Optical Flow [94.42139459221784]
We propose to modify the common training protocols of optical flow. The improvement is based on observing the bias in sampling challenging data. We find that both regularization and augmentation should decrease during the training protocol.
arXiv Detail & Related papers (2020-02-25T09:58:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.