Symbolic Discovery of Optimization Algorithms
- URL: http://arxiv.org/abs/2302.06675v4
- Date: Mon, 8 May 2023 21:49:57 GMT
- Title: Symbolic Discovery of Optimization Algorithms
- Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao
Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V.
Le
- Abstract summary: We use efficient search techniques to explore an infinite and sparse program space.
Our method discovers a simple and effective optimization algorithm, $textbfLion$.
Lion is successfully deployed in production systems such as Google search ads CTR model.
- Score: 132.62397077095787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method to formulate algorithm discovery as program search, and
apply it to discover optimization algorithms for deep neural network training.
We leverage efficient search techniques to explore an infinite and sparse
program space. To bridge the large generalization gap between proxy and target
tasks, we also introduce program selection and simplification strategies. Our
method discovers a simple and effective optimization algorithm, $\textbf{Lion}$
($\textit{Evo$\textbf{L}$ved S$\textbf{i}$gn M$\textbf{o}$me$\textbf{n}$tum}$).
It is more memory-efficient than Adam as it only keeps track of the momentum.
Different from adaptive optimizers, its update has the same magnitude for each
parameter calculated through the sign operation. We compare Lion with widely
used optimizers, such as Adam and Adafactor, for training a variety of models
on different tasks. On image classification, Lion boosts the accuracy of ViT by
up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On
vision-language contrastive learning, we achieve 88.3% $\textit{zero-shot}$ and
91.1% $\textit{fine-tuning}$ accuracy on ImageNet, surpassing the previous best
results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms
Adam by achieving a better FID score and reducing the training compute by up to
2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion
exhibits a similar or better performance compared to Adam. Our analysis of Lion
reveals that its performance gain grows with the training batch size. It also
requires a smaller learning rate than Adam due to the larger norm of the update
produced by the sign function. Additionally, we examine the limitations of Lion
and identify scenarios where its improvements are small or not statistically
significant. Lion is also successfully deployed in production systems such as
Google search ads CTR model.
Related papers
- Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction [62.8375542401319]
Multimodal Large Language Models (MLLMs) encode the input image(s) as vision tokens and feed them into the language backbone.
The number of vision tokens increases quadratically as the image resolutions, leading to huge computational costs.
We propose a greedy search algorithm (G-Search) to find the least number of vision tokens to keep at each layer from the shallow to the deep.
arXiv Detail & Related papers (2024-11-30T18:54:32Z) - MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs [18.242110417706]
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model.
We show the optimality of this approach for fine-tuning tasks under certain conditions.
Our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour.
arXiv Detail & Related papers (2024-05-05T00:08:00Z) - Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution [0.0]
A major contributor to the quality of a deep learning model is the selection of the Conv.
We propose a new dual-joint search space in realm neural search (NOS), along with an integrity check, to automate the process of finding deep learnings.
We find multiples, learning rate schedules, and Adam variants that outperformed Adam, as well as other standard deep learnings, across the image classification tasks.
arXiv Detail & Related papers (2024-04-10T02:00:24Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Lion Secretly Solves Constrained Optimization: As Lyapunov Predicts [8.393403749426097]
Lion (Evolved Sign Momentum) has shown promising results in training large AI models.
It performs comparably or favorably to AdamW but with greater memory efficiency.
Our analysis is made possible by the development of a new Lyapunov function for the Lion updates.
arXiv Detail & Related papers (2023-10-09T17:41:29Z) - ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - Differentially Private Image Classification from Features [53.75086935617644]
Leveraging transfer learning has been shown to be an effective strategy for training large models with Differential Privacy.
Recent works have found that privately training just the last layer of a pre-trained model provides the best utility with DP.
arXiv Detail & Related papers (2022-11-24T04:04:20Z) - A contextual analysis of multi-layer perceptron models in classifying
hand-written digits and letters: limited resources [0.0]
We extensively test an end-to-end vanilla neural network (MLP) approach in pure numpy without any pre-processing or feature extraction done beforehand.
We show that basic data mining operations can significantly improve the performance of the models in terms of computational time.
arXiv Detail & Related papers (2021-07-05T04:30:37Z) - TAdam: A Robust Stochastic Gradient Optimizer [6.973803123972298]
Machine learning algorithms aim to find patterns from observations, which may include some noise, especially in robotics domain.
To perform well even with such noise, we expect them to be able to detect outliers and discard them when needed.
We propose a new gradient optimization method, whose robustness is directly built in the algorithm, using the robust student-t distribution as its core idea.
arXiv Detail & Related papers (2020-02-29T04:32:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.