KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural
Networks
- URL: http://arxiv.org/abs/2107.01739v1
- Date: Sun, 4 Jul 2021 21:34:22 GMT
- Title: KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural
Networks
- Authors: J. Gregory Pauloski, Qi Huang, Lei Huang, Shivaram Venkataraman, Kyle
Chard, Ian Foster, Zhao Zhang
- Abstract summary: We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order framework.
We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models.
- Score: 11.340789769829069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to
converge faster in deep neural network (DNN) training than stochastic gradient
descent (SGD); however, K-FAC's larger memory footprint hinders its
applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable,
Improved, and ScAlable second-order optimizer framework that adapts the memory
footprint, communication, and computation given specific models and hardware to
achieve maximized performance and enhanced scalability. We quantify the
tradeoffs between memory and communication cost and evaluate KAISA on large
models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA
A100 GPUs. Compared to the original optimizers, KAISA converges 18.1-36.3%
faster across applications with the same global batch size. Under a fixed
memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and
BERT-Large, respectively. KAISA can balance memory and communication to achieve
scaling efficiency equal to or better than the baseline optimizers.
Related papers
- Adapprox: Adaptive Approximation in Adam Optimization via Randomized Low-Rank Matrices [24.319712013824876]
Adapprox is a novel approach that employs randomized low-rank matrix approximation for a more accurate approximation of Adam's second moment.
In GPT-2 training and downstream tasks, Adapprox surpasses AdamW by achieving 34.5% to 49.9% memory savings.
arXiv Detail & Related papers (2024-03-22T05:23:31Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - CAME: Confidence-guided Adaptive Memory Efficient Optimization [20.009302737137787]
Adaptive gradient methods have demonstrated excellent performance in the training of large language models.
The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads.
Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty.
We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
arXiv Detail & Related papers (2023-07-05T06:05:36Z) - More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using
Sparsity [103.62784587778037]
Recently, a couple of advanced convolutional models strike back with large kernels motivated by the local but large attention mechanism.
We propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers.
arXiv Detail & Related papers (2022-07-07T23:55:52Z) - Scalable K-FAC Training for Deep Neural Networks with Distributed
Preconditioning [19.04755792575149]
We propose DP-KFAC, a novel distributed preconditioning scheme for deep neural network (DNN) training.
DP-KFAC reduces computation overhead by 1.55x-1.65x, the communication cost by 2.79x-3.15x, and the memory footprint by 1.14x-1.47x in each second-order update.
arXiv Detail & Related papers (2022-06-30T09:22:25Z) - HANT: Hardware-Aware Network Transformation [82.54824188745887]
We propose hardware-aware network transformation (HANT)
HANT replaces inefficient operations with more efficient alternatives using a neural architecture search like approach.
Our results on accelerating the EfficientNet family show that HANT can accelerate them by up to 3.6x with 0.4% drop in the top-1 accuracy on the ImageNet dataset.
arXiv Detail & Related papers (2021-07-12T18:46:34Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - Optimizing Memory Efficiency of Graph NeuralNetworks on Edge Computing
Platforms [10.045922468883486]
Graph neural networks (GNN) have achieved state-of-the-art performance on various industrial tasks.
A feature decomposition approach is proposed for memory efficiency optimization of GNN inference.
The proposed approach could achieve outstanding optimization on various GNN models, covering a wide range of datasets, which speeds up the inference by up to 3x.
arXiv Detail & Related papers (2021-04-07T11:15:12Z) - Optimizing Memory Placement using Evolutionary Graph Reinforcement
Learning [56.83172249278467]
We introduce Evolutionary Graph Reinforcement Learning (EGRL), a method designed for large search spaces.
We train and validate our approach directly on the Intel NNP-I chip for inference.
We additionally achieve 28-78% speed-up compared to the native NNP-I compiler on all three workloads.
arXiv Detail & Related papers (2020-07-14T18:50:12Z) - Convolutional Neural Network Training with Distributed K-FAC [14.2773046188145]
Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an approximation of the Fisher Information Matrix.
We investigate here a scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale.
arXiv Detail & Related papers (2020-07-01T22:00:53Z) - FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining [65.39532971991778]
We present an accuracy predictor that scores architecture and training recipes jointly, guiding both sample selection and ranking.
We run fast evolutionary searches in just CPU minutes to generate architecture-recipe pairs for a variety of resource constraints.
FBNetV3 makes up a family of state-of-the-art compact neural networks that outperform both automatically and manually-designed competitors.
arXiv Detail & Related papers (2020-06-03T05:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.