Related papers: Zero-th Order Algorithm for Softmax Attention Optimization

Zero-th Order Algorithm for Softmax Attention Optimization

URL: http://arxiv.org/abs/2307.08352v1
Date: Mon, 17 Jul 2023 09:43:50 GMT
Title: Zero-th Order Algorithm for Softmax Attention Optimization
Authors: Yichuan Deng, Zhihang Li, Sridhar Mahadevan, Zhao Song
Abstract summary: We present a Zero-th Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale language models.
Score: 21.631643446337737
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have brought about significant transformations in human society. Among the crucial computations in LLMs, the softmax unit holds great importance. Its helps the model generating a probability distribution on potential subsequent words or phrases, considering a series of input words. By utilizing this distribution, the model selects the most probable next word or phrase, based on the assigned probabilities. The softmax unit assumes a vital function in LLM training as it facilitates learning from data through the adjustment of neural network weights and biases. With the development of the size of LLMs, computing the gradient becomes expensive. However, Zero-th Order method can approximately compute the gradient with only forward passes. In this paper, we present a Zero-th Order algorithm specifically tailored for Softmax optimization. We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale LLMs. By leveraging the Zeroth-Order method, our work contributes to the advancement of optimization techniques in the context of complex language models.

Related papers

Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models [0.36326779753373206]
Zeroth-Order (ZO) optimisation, using function evaluations instead of gradients, reduces memory usage but suffers from slow convergence in high-dimensional models. We introduce ZOPrO, a novel ZO algorithm designed for Preference optimisation in LLMs. We demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods.
arXiv Detail & Related papers (2025-03-05T12:49:48Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase. Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative. We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Algorithmic Language Models with Neurally Compiled Libraries [16.284360949127723]
Large Language Models lack true algorithmic ability. Our paper proposes augmenting LLMs with a library of fundamental operations and sophisticated differentiable programs. We explore the feasability of augmenting LLaMA3 with a differentiable computer.
arXiv Detail & Related papers (2024-07-06T00:27:05Z)
Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention. Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Large Language Models As Evolution Strategies [6.873777465945062]
In this work, we investigate whether large language models (LLMs) are in principle capable of implementing evolutionary optimization algorithms. We introduce a novel prompting strategy, consisting of least-to-most sorting of discretized population members. We find that our setup allows the user to obtain an LLM-based evolution strategy, which we call EvoLLM', that robustly outperforms baseline algorithms.
arXiv Detail & Related papers (2024-02-28T15:02:17Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z)
Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z)
Optimizing the optimizer for data driven deep neural networks and physics informed neural networks [2.54325834280441]
We investigate the role of methods in determining the quality of the model fit for neural networks with a small to medium number of parameters. We find that LM algorithm is able to rapidly converge to machine precision offering significant benefits over other algorithms.
arXiv Detail & Related papers (2022-05-16T02:42:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.