Related papers: Lizard: An Efficient Linearization Framework for Large Language Models

Lizard: An Efficient Linearization Framework for Large Language Models

URL: http://arxiv.org/abs/2507.09025v3
Date: Thu, 09 Oct 2025 20:37:43 GMT
Title: Lizard: An Efficient Linearization Framework for Large Language Models
Authors: Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen,
Abstract summary: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures.<n>Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality.<n>We show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark.
Score: 113.87302474262798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due to the quadratic complexity of softmax attention and the growing Key-Value (KV) cache that makes inference memory-bound by context length. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving model quality. Unlike prior linearization methods constrained by fixed, non-adaptive structures, Lizard augments the architecture with compact, learnable modules that enable adaptive memory control and robust length generalization. Moreover, we introduce a hardwareaware algorithm that solves numerical instability in gated attention to accelerate training. Extensive experiments show that Lizard achieves near-lossless recovery of its teacher model's performance, significantly outperforming previous methods by up to 9.4 - 24.5 points on the 5-shot MMLU benchmark and demonstrating superior associative recall.

Related papers

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation [57.57816409869894]
We introduce POET-X, a scalable and memory-efficient variant for training large language models.<n>PoET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency.
arXiv Detail & Related papers (2026-03-05T18:59:23Z)
Memory Caching: RNNs with Growing Memory [56.25483647131372]
We introduce Memory Caching (MC), a technique that enhances recurrent models by caching checkpoints of memory states (a.k.a. hidden states)<n>We propose four variants of MC, including gated aggregation and sparse selective mechanisms, and discuss their implications on both linear and deep memory modules.<n>The results indicate that while Transformers achieve the best accuracy, our MC variants show competitive performance, close the gap with Transformers, and performs better than state-of-the-art recurrent models.
arXiv Detail & Related papers (2026-02-27T18:53:41Z)
AllMem: A Memory-centric Recipe for Efficient Long-context Modeling [32.025154452526856]
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks.<n>We introduce textscAllMem, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks.
arXiv Detail & Related papers (2026-02-14T09:04:28Z)
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training [67.45211108321203]
We introduce a numerically stable, chunkwise parallelizable version of the recently proposed Mesa layer.<n>We show that optimal test-time training enables reaching lower language modeling perplexity and higher downstream benchmark performance than previous RNNs.
arXiv Detail & Related papers (2025-06-05T16:50:23Z)
Probing In-Context Learning: Impact of Task Complexity and Model Architecture on Generalization and Efficiency [10.942999793311765]
We investigate in-context learning (ICL) through a meticulous experimental framework that systematically varies task complexity and model architecture.<n>We evaluate four distinct models: a GPT2-style Transformer, a Transformer with FlashAttention mechanism, a convolutional Hyena-based model, and the Mamba state-space model.
arXiv Detail & Related papers (2025-05-10T00:22:40Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
HyLiFormer: Hyperbolic Linear Attention for Skeleton-based Human Action Recognition [20.45747733568704]
We propose HyLiFormer, a novel hyperbolic linear attention Transformer tailored for skeleton-based action recognition.<n>Our approach incorporates a Hyperbolic Transformation with Curvatures (HTC) module to map skeleton data into hyperbolic space and a Hyperbolic Linear Attention (HLA) module for efficient long-range dependency modeling.
arXiv Detail & Related papers (2025-02-09T12:08:03Z)
ReGLA: Refining Gated Linear Attention [42.97193398172823]
Linear attention has been designed to reduce the quadratic space-time complexity that is inherent in standard transformers.<n>We developed a feature mapping function to address some crucial issues that previous suggestions overlooked.<n>We also explored the saturation phenomenon of the gating mechanism and augmented it with a refining module.
arXiv Detail & Related papers (2025-02-03T18:03:13Z)
Byte Latent Transformer: Patches Scale Better Than Tokens [101.10994909832063]
Byte Latent Transformer (BLT) encodes bytes into dynamically sized patches, which serve as the primary units of computation.<n>For fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
arXiv Detail & Related papers (2024-12-13T05:33:32Z)
Longhorn: State Space Models are Amortized Online Learners [51.10124201221601]
State-space models (SSMs) offer linear decoding efficiency while maintaining parallelism during training. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. We introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem.
arXiv Detail & Related papers (2024-07-19T11:12:08Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
PolySketchFormer: Fast Transformers via Sketching Polynomial Kernels [23.99075223506133]
We show that attention with high degree can effectively replace softmax without sacrificing model quality. We present a block-based algorithm to apply causal masking efficiently. We validate PolySketchFormerAttention empirically by training language models capable of handling long contexts.
arXiv Detail & Related papers (2023-10-02T21:39:04Z)
TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer [34.790081960470964]
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) We make advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization. We validate our model design through a series of ablations and train models with sizes of 385M, 1B, and 7B on our self-collected corpus.
arXiv Detail & Related papers (2023-07-27T16:45:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.