The Fine-Grained Complexity of Gradient Computation for Training Large
Language Models
- URL: http://arxiv.org/abs/2402.04497v1
- Date: Wed, 7 Feb 2024 00:45:31 GMT
- Title: The Fine-Grained Complexity of Gradient Computation for Training Large
Language Models
- Authors: Josh Alman, Zhao Song
- Abstract summary: Large language models (LLMs) have made fundamental contributions over the last a few years.
We show nearly identical results for the harder-seeming problem of computing the gradient of loss function of one layer attention network.
- Score: 12.853829771559916
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have made fundamental contributions over the
last a few years. To train an LLM, one needs to alternatingly run `forward'
computations and `backward' computations. The forward computation can be viewed
as attention function evaluation, and the backward computation can be viewed as
a gradient computation. In previous work by [Alman and Song, NeurIPS 2023], it
was proved that the forward step can be performed in almost-linear time in
certain parameter regimes, but that there is no truly sub-quadratic time
algorithm in the remaining parameter regimes unless the popular hypothesis SETH
is false. In this work, we show nearly identical results for the harder-seeming
problem of computing the gradient of loss function of one layer attention
network, and thus for the entire process of LLM training. This completely
characterizes the fine-grained complexity of every step of LLM training.
Related papers
- Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling [44.31966204357333]
We develop memory-efficient algorithms for large-scale machine learning problems.
We use two key techniques to make our approach memory-efficient and avoid full computations.
arXiv Detail & Related papers (2025-02-20T15:37:45Z) - IGC: Integrating a Gated Calculator into an LLM to Solve Arithmetic Tasks Reliably and Efficiently [17.525220958618988]
We introduce the Integrated Gated Calculator (IGC), a module that enables Large Language Models to perform arithmetic by emulating a calculator on the GPU.
We finetune a Llama model with our module and test it on the BigBench Arithmetic benchmark, where it beats the State of the Art.
Our approach takes only a single iteration to run and requires no external tools.
arXiv Detail & Related papers (2025-01-01T00:01:27Z) - Efficiently Scaling LLM Reasoning with Certaindex [25.549811985276488]
Test-time reasoning algorithms can wastefully generate many tokens without improving accuracy.<n>We introduce Certaindex, an algorithm-agnostic metric measuring when further computation is unlikely to alter the final result.<n>Certaindex is lightweight, can accelerate reasoning program inference via early exit, and enables dynamic token allocation.
arXiv Detail & Related papers (2024-12-30T14:57:53Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - Interpreting and Improving Large Language Models in Arithmetic Calculation [72.19753146621429]
Large language models (LLMs) have demonstrated remarkable potential across numerous applications.
In this work, we delve into uncovering a specific mechanism by which LLMs execute calculations.
We investigate the potential benefits of selectively fine-tuning these essential heads/MLPs to boost the LLMs' computational performance.
arXiv Detail & Related papers (2024-09-03T07:01:46Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - Temporal Scaling Law for Large Language Models [57.83580734589091]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.<n>In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.<n>We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z) - Reverse That Number! Decoding Order Matters in Arithmetic Learning [49.5504492920404]
Our work introduces a novel strategy that reevaluates the digit order by prioritizing output from the least significant digit.
Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training.
arXiv Detail & Related papers (2024-03-09T09:04:53Z) - Quantum Many-Body Physics Calculations with Large Language Models [7.679615503214482]
Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains.
We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method.
We design multi-step prompt templates that break down the analytic calculation into standardized steps.
We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade.
arXiv Detail & Related papers (2024-03-05T17:47:22Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Acceleration of Subspace Learning Machine via Particle Swarm
Optimization and Parallel Processing [23.33955958124822]
Subspace learning machine (SLM) has been proposed to offer higher performance in general classification and regression tasks.
Performance improvement is reached at the expense of higher computational complexity.
Experimental results show that the accelerated SLM method achieves a speed up factor of 577 in training time.
arXiv Detail & Related papers (2022-08-15T06:33:15Z) - Optimization-based Block Coordinate Gradient Coding for Mitigating
Partial Stragglers in Distributed Learning [58.91954425047425]
This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning.
We propose a gradient coordinate coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes.
arXiv Detail & Related papers (2022-06-06T09:25:40Z) - Meta-Learning with Adjoint Methods [16.753336086160598]
A Meta-Learning (MAML) is widely used to find a good initialization for a family of tasks.
Despite its success, a critical challenge in MAML is to calculate the gradient w.r.t the initialization of a long training trajectory for the sampled tasks.
We propose Adjoint MAML (A-MAML) to address this problem.
We demonstrate the advantage of our approach in both synthetic and real-world meta-learning tasks.
arXiv Detail & Related papers (2021-10-16T01:18:50Z) - Efficient time stepping for numerical integration using reinforcement
learning [0.15393457051344295]
We propose a data-driven time stepping scheme based on machine learning and meta-learning.
First, one or several (in the case of non-smooth or hybrid systems) base learners are trained using RL.
Then, a meta-learner is trained which (depending on the system state) selects the base learner that appears to be optimal for the current situation.
arXiv Detail & Related papers (2021-04-08T07:24:54Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.