Liger: Linearizing Large Language Models to Gated Recurrent Structures
- URL: http://arxiv.org/abs/2503.01496v1
- Date: Mon, 03 Mar 2025 13:08:00 GMT
- Title: Liger: Linearizing Large Language Models to Gated Recurrent Structures
- Authors: Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng,
- Abstract summary: Linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures.<n>Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters.
- Score: 9.665802842933209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.
Related papers
- AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - In-Context Learning of Polynomial Kernel Regression in Transformers with GLU Layers [19.606993863000906]
Transformer-based models have demonstrated remarkable ability in in-context learning (ICL)<n>Recent research has provided insight into how linear Transformers can perform ICL by implementing gradient descent estimators.<n>We study a mechanism that combines LSA with GLU-like feed-forward layers and show that this allows the model to perform one step of gradient descent on a kernel regression.
arXiv Detail & Related papers (2025-01-30T07:41:20Z) - Transformer Block Coupling and its Correlation with Generalization in LLMs [3.007031501305338]
Large Language Models (LLMs) have made significant strides in natural language processing.<n>We trace the trajectories of individual tokens as they pass through transformer blocks, and linearize the system along these trajectories through their Jacobian matrices.<n>We uncover a $textbftransformer block coupling$ phenomenon in a variety of LLMs, characterized by the coupling of their top singular vectors across tokens and depth.
arXiv Detail & Related papers (2024-07-10T16:30:27Z) - Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive.
Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs)
We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget.
We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z) - Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs)
Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs.
Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Learning to Grow Pretrained Models for Efficient Transformer Training [72.20676008625641]
We learn to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model.
Experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50% computational cost of training from scratch.
arXiv Detail & Related papers (2023-03-02T05:21:18Z) - Bootstrapped Transformer for Offline Reinforcement Learning [31.43012728924881]
offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment.
Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem.
We propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data.
arXiv Detail & Related papers (2022-06-17T05:57:47Z) - LQF: Linear Quadratic Fine-Tuning [114.3840147070712]
We present the first method for linearizing a pre-trained model that achieves comparable performance to non-linear fine-tuning.
LQF consists of simple modifications to the architecture, loss function and optimization typically used for classification.
arXiv Detail & Related papers (2020-12-21T06:40:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.