Related papers: Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

URL: http://arxiv.org/abs/2502.07752v2
Date: Thu, 20 Feb 2025 18:48:58 GMT
Title: Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension
Authors: Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds,
Abstract summary: This paper makes a step towards the systematic design of efficient approximations through the lens of Fisher information matrix (FIM)<n>We show that many state-of-the-art efficient approximations can be viewed as solutions to FIM (under the Frobenius norm) with specific structural assumptions.<n>We propose two design recommendations of practical efficients for LLMs, involving careful selection of structural assumptions to balance generality and efficiency.
Score: 16.037614012166063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

Related papers

Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores [2.474203056060563]
Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters.<n>Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations.<n>We propose RelTune, a novel framework that represents parameter dependencies as a Graph and learns GNN-based latent embeddings that encode performancerelevant semantics.
arXiv Detail & Related papers (2025-10-31T03:46:42Z)
IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z)
ESSA: Evolutionary Strategies for Scalable Alignment [2.589791058467358]
This paper introduces ESSA, a new framework that uses Evolutionary Strategies (ES) to efficiently align Large Language Models (LLMs)<n>ES is well-suited for LLM alignment due to its favorable properties, such as high parallelizability, memory efficiency, robustness to sparse rewards, and fewer data samples required for convergence.<n>Our findings establish ES as a promising and scalable alternative to gradient-based alignment, paving the way for efficient post-training of large language models.
arXiv Detail & Related papers (2025-07-06T16:23:07Z)
Make Optimization Once and for All with Fine-grained Guidance [78.14885351827232]
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks. L2O paradigms achieve great outcomes, e.g., refitting, generating unseen solutions iteratively or directly. Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting solutions from a wider view.
arXiv Detail & Related papers (2025-03-14T14:48:12Z)
COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [81.01082659623552]
Large Language Models (LLMs) have demonstrated remarkable success across various domains. Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z)
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System [75.25394449773052]
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving. Yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness.
arXiv Detail & Related papers (2024-10-10T17:00:06Z)
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Memory-Efficient Optimization with Factorized Hamiltonian Descent [11.01832755213396]
We introduce a novel adaptive, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge. By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees.
arXiv Detail & Related papers (2024-06-14T12:05:17Z)
Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning. Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques. Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z)
AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models. AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z)
CAME: Confidence-guided Adaptive Memory Efficient Optimization [20.009302737137787]
Adaptive gradient methods have demonstrated excellent performance in the training of large language models. The need for maintaining second-moment estimates requires maintaining a high cost of extra memory overheads. Several memory-efficients have been proposed to obtain a drastic reduction in auxiliary memory usage, but with a performance penalty. We propose CAME to simultaneously achieve two goals: fast convergence as in traditional adaptive methods, and low memory usage as in memory-efficient methods.
arXiv Detail & Related papers (2023-07-05T06:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.