DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training
- URL: http://arxiv.org/abs/2310.02025v4
- Date: Fri, 15 Mar 2024 15:28:11 GMT
- Title: DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training
- Authors: Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, Sijia Liu,
- Abstract summary: Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems.
No prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance.
We develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch.
- Score: 33.11416096294998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zeroth-order (ZO) optimization has become a popular technique for solving machine learning (ML) problems when first-order (FO) information is difficult or impossible to obtain. However, the scalability of ZO optimization remains an open problem: Its use has primarily been limited to relatively small-scale ML problems, such as sample-wise adversarial attack generation. To our best knowledge, no prior work has demonstrated the effectiveness of ZO optimization in training deep neural networks (DNNs) without a significant decrease in performance. To overcome this roadblock, we develop DeepZero, a principled ZO deep learning (DL) framework that can scale ZO optimization to DNN training from scratch through three primary innovations. First, we demonstrate the advantages of coordinatewise gradient estimation (CGE) over randomized vector-wise gradient estimation in training accuracy and computational efficiency. Second, we propose a sparsityinduced ZO training protocol that extends the model pruning methodology using only finite differences to explore and exploit the sparse DL prior in CGE. Third, we develop the methods of feature reuse and forward parallelization to advance the practical implementations of ZO training. Our extensive experiments show that DeepZero achieves state-of-the-art (SOTA) accuracy on ResNet-20 trained on CIFAR-10, approaching FO training performance for the first time. Furthermore, we show the practical utility of DeepZero in applications of certified adversarial defense and DL-based partial differential equation error correction, achieving 10-20% improvement over SOTA. We believe our results will inspire future research on scalable ZO optimization and contribute to advancing DL with black box. Codes are available at https://github.com/OPTML-Group/DeepZero.
Related papers
- Dynamic Sparse Training via Balancing the Exploration-Exploitation
Trade-off [19.230329532065635]
Sparse training could significantly mitigate the training costs by reducing the model size.
Existing sparse training methods mainly use either random-based or greedy-based drop-and-grow strategies.
In this work, we consider the dynamic sparse training as a sparse connectivity search problem.
Experimental results show that sparse models (up to 98% sparsity) obtained by our proposed method outperform the SOTA sparse training methods.
arXiv Detail & Related papers (2022-11-30T01:22:25Z) - Towards Theoretically Inspired Neural Initialization Optimization [66.04735385415427]
We propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network.
We show that both the training and test performance of a network can be improved by maximizing GradCosine under norm constraint.
Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost.
arXiv Detail & Related papers (2022-10-12T06:49:16Z) - Two Heads are Better than One: Robust Learning Meets Multi-branch Models [14.72099568017039]
We propose Branch Orthogonality adveRsarial Training (BORT) to obtain state-of-the-art performance with solely the original dataset for adversarial training.
We evaluate our approach on CIFAR-10, CIFAR-100, and SVHN against ell_infty norm-bounded perturbations of size epsilon = 8/255, respectively.
arXiv Detail & Related papers (2022-08-17T05:42:59Z) - Controlled Sparsity via Constrained Optimization or: How I Learned to
Stop Tuning Penalties and Love Constraints [81.46143788046892]
We focus on the task of controlling the level of sparsity when performing sparse learning.
Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor.
We propose a constrained formulation where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion.
arXiv Detail & Related papers (2022-08-08T21:24:20Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via
Adversarial Fine-tuning [90.44219200633286]
We propose a simple yet very effective adversarial fine-tuning approach based on a $textitslow start, fast decay$ learning rate scheduling strategy.
Experimental results show that the proposed adversarial fine-tuning approach outperforms the state-of-the-art methods on CIFAR-10, CIFAR-100 and ImageNet datasets.
arXiv Detail & Related papers (2020-12-25T20:50:15Z) - Training Sparse Neural Networks using Compressed Sensing [13.84396596420605]
We develop and test a novel method based on compressed sensing which combines the pruning and training into a single step.
Specifically, we utilize an adaptively weighted $ell1$ penalty on the weights during training, which we combine with a generalization of the regularized dual averaging (RDA) algorithm in order to train sparse neural networks.
arXiv Detail & Related papers (2020-08-21T19:35:54Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Passive Batch Injection Training Technique: Boosting Network Performance
by Injecting Mini-Batches from a different Data Distribution [39.8046809855363]
This work presents a novel training technique for deep neural networks that makes use of additional data from a distribution that is different from that of the original input data.
To the best of our knowledge, this is the first work that makes use of different data distribution to aid the training of convolutional neural networks (CNNs)
arXiv Detail & Related papers (2020-06-08T08:17:32Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.