Zero-th Order Algorithm for Softmax Attention Optimization
- URL: http://arxiv.org/abs/2307.08352v1
- Date: Mon, 17 Jul 2023 09:43:50 GMT
- Title: Zero-th Order Algorithm for Softmax Attention Optimization
- Authors: Yichuan Deng, Zhihang Li, Sridhar Mahadevan, Zhao Song
- Abstract summary: We present a Zero-th Order algorithm specifically tailored for Softmax optimization.
We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale language models.
- Score: 21.631643446337737
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have brought about significant transformations
in human society. Among the crucial computations in LLMs, the softmax unit
holds great importance. Its helps the model generating a probability
distribution on potential subsequent words or phrases, considering a series of
input words. By utilizing this distribution, the model selects the most
probable next word or phrase, based on the assigned probabilities. The softmax
unit assumes a vital function in LLM training as it facilitates learning from
data through the adjustment of neural network weights and biases.
With the development of the size of LLMs, computing the gradient becomes
expensive. However, Zero-th Order method can approximately compute the gradient
with only forward passes. In this paper, we present a Zero-th Order algorithm
specifically tailored for Softmax optimization. We demonstrate the convergence
of our algorithm, highlighting its effectiveness in efficiently computing
gradients for large-scale LLMs. By leveraging the Zeroth-Order method, our work
contributes to the advancement of optimization techniques in the context of
complex language models.
Related papers
- SubZero: Random Subspace Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Algorithmic Language Models with Neurally Compiled Libraries [16.284360949127723]
Large Language Models lack true algorithmic ability.
Our paper proposes augmenting LLMs with a library of fundamental operations and sophisticated differentiable programs.
We explore the feasability of augmenting LLaMA3 with a differentiable computer.
arXiv Detail & Related papers (2024-07-06T00:27:05Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Large Language Models As Evolution Strategies [6.873777465945062]
In this work, we investigate whether large language models (LLMs) are in principle capable of implementing evolutionary optimization algorithms.
We introduce a novel prompting strategy, consisting of least-to-most sorting of discretized population members.
We find that our setup allows the user to obtain an LLM-based evolution strategy, which we call EvoLLM', that robustly outperforms baseline algorithms.
arXiv Detail & Related papers (2024-02-28T15:02:17Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications.
Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs.
We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z) - Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit.
In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z) - Optimizing the optimizer for data driven deep neural networks and
physics informed neural networks [2.54325834280441]
We investigate the role of methods in determining the quality of the model fit for neural networks with a small to medium number of parameters.
We find that LM algorithm is able to rapidly converge to machine precision offering significant benefits over other algorithms.
arXiv Detail & Related papers (2022-05-16T02:42:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.