Related papers: A Proximal Operator for Inducing 2:4-Sparsity

A Proximal Operator for Inducing 2:4-Sparsity

URL: http://arxiv.org/abs/2501.18015v1
Date: Wed, 29 Jan 2025 22:05:17 GMT
Title: A Proximal Operator for Inducing 2:4-Sparsity
Authors: Jonas M Kübler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matthäus Kleindessner, Kailash Budhathoki, Volkan Cevher, George Karypis,
Abstract summary: We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models.<n>We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters.
Score: 68.98036844970986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent hardware advancements in AI Accelerators and GPUs allow to efficiently compute sparse matrix multiplications, especially when 2 out of 4 consecutive weights are set to zero. However, this so-called 2:4 sparsity usually comes at a decreased accuracy of the model. We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models. We minimize the regularizer jointly with a local squared loss by deriving the proximal operator for which we show that it has an efficient solution in the 2:4-sparse case. After optimizing the mask, we use maskedgradient updates to further minimize the local squared loss. We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance.

Related papers

1000+ FPS 4D Gaussian Splatting for Dynamic Scene Rendering [60.676919690136096]
We present textbf4DGS-1K, which runs at over 1000 FPS on modern scene GPUs. For Q1, we introduce the Spatial-Temporal Variation Score, a new pruning criterion that effectively removes short-lifespan Gaussians. For Q2, we store a mask for active Gaussians across consecutive frames, significantly reducing redundant computations in rendering.
arXiv Detail & Related papers (2025-03-20T17:59:44Z)
Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training [3.195234044113248]
We propose textscNeuroAL, a emphtop-up algorithm for network pruning.<n>It modifies the block-wise and row-wise sparsity exploiting information from both the dense model and its sparse version.<n>It consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
arXiv Detail & Related papers (2024-11-11T15:30:16Z)
Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z)
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training [20.113352600259226]
We propose S-STE, a simple yet powerful 2:4 training method that contains two parts: to continuously project weights to be 2:4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor.<n>Results show that our method surpasses previous 2:4 pre-training recipes and is comparable even with full parameter models.
arXiv Detail & Related papers (2024-09-13T08:29:36Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs. We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z)
SliceGPT: Compress Large Language Models by Deleting Rows and Columns [27.004657436024853]
We present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. We show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance.
arXiv Detail & Related papers (2024-01-26T17:35:45Z)
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models [57.04178959678024]
We show that the majority of inference computations for large generative models can be performed with both weights and activations being cast to 4 bits. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit. We provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.4x.
arXiv Detail & Related papers (2023-10-13T17:15:05Z)
Monarch: Expressive Structured Matrices for Efficient and Accurate Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones. We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z)
Minimum Variance Unbiased N:M Sparsity for the Neural Gradients [29.555643722721882]
In deep learning, fine-grained N:M sparsity reduces the data footprint and bandwidth of a General Matrix multiply (GEMM) up to x2. We examine how this method can be used also for the neural gradients.
arXiv Detail & Related papers (2022-03-21T13:59:43Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity [12.643043455369297]
We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.
arXiv Detail & Related papers (2020-08-29T16:27:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.