Related papers: To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training

To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training

URL: http://arxiv.org/abs/2602.06183v1
Date: Thu, 05 Feb 2026 20:43:24 GMT
Title: To 2:4 Sparsity and Beyond: Neuron-level Activation Function to Accelerate LLM Pre-Training
Authors: Meghana Madhyastha, Daniel Haziza, Jesse Cai, Newsha Ardalani, Zhiqi Bu, Carole-Jean Wu,
Abstract summary: We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the Feed Forward Network (FFN)<n>Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end.<n>Models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x.
Score: 17.090117647151708
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Trainings of Large Language Models are generally bottlenecked by matrix multiplications. In the Transformer architecture, a large portion of these operations happens in the Feed Forward Network (FFN), and this portion increases for larger models, up to 50% of the total pretraining floating point operations. We show that we can leverage hardware-accelerated sparsity to accelerate all matrix multiplications in the FFN, with 2:4 sparsity for weights and v:n:m (Venom) sparsity for activations. Our recipe relies on sparse training steps to accelerate a large part of the pretraining, associated with regular dense training steps towards the end. Overall, models trained with this approach exhibit the same performance on our quality benchmarks, and can speed up training end-to-end by 1.4 to 1.7x. This approach is applicable to all NVIDIA GPUs starting with the A100 generation, and is orthogonal to common optimization techniques, such as, quantization, and can also be applied to mixture-of-experts model architectures.

Related papers

DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z)
NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium [4.7520621855466425]
We design high-performance matmul, a critical compute kernel, for LLM inference on Trainium.<n>We show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium.
arXiv Detail & Related papers (2025-10-29T21:22:08Z)
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models [49.911784762244814]
TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models (DLMs)<n>We derive a series of state-of-the-art diffusion language models, namely TraDo.<n>TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-08T17:58:06Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations.<n>Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression.<n>We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers [16.253898272659242]
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs) We show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off.
arXiv Detail & Related papers (2024-06-24T08:43:21Z)
Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain. We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance. We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
Multiplication-Free Transformer Training via Piecewise Affine Operations [44.99157696237478]
We replace multiplication with a cheap piecewise affine approximation that is achieved by adding the bit representation of the floating point numbers together as integers. We show that transformers can be trained with the resulting modified matrix multiplications on both vision and language tasks with little to no performance impact.
arXiv Detail & Related papers (2023-05-26T18:28:28Z)
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs [1.7481226034111275]
This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. It achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
arXiv Detail & Related papers (2023-05-22T22:41:49Z)
DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining [22.168137965177284]
Communication scheduling has been shown to be effective in accelerating distributed training. We propose a novel scheduling algorithm, DeAR, that decouples the all-reduce primitive into two continuous operations. We show that DeAR achieves up to 83% and 15% training speedup over the state-of-the-art solutions.
arXiv Detail & Related papers (2023-02-24T04:11:18Z)
LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning [2.0625936401496237]
Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems. We present a real-time sparse training acceleration system named LearningGroup. Our system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x.
arXiv Detail & Related papers (2022-10-29T15:09:34Z)
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs [15.465530153038927]
Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
arXiv Detail & Related papers (2021-09-16T04:12:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.