Related papers: NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM

URL: http://arxiv.org/abs/2110.15766v1
Date: Thu, 28 Oct 2021 17:43:06 GMT
Title: NxMTransformer: Semi-Structured Sparsification for Natural Language Understanding via ADMM
Authors: Connor Holmes, Minjia Zhang, Yuxiong He, and Bo Wu
Abstract summary: We introduce a new learning framework, called NxMTransformer, to induce NxM semi-structured sparsity on pretrained language models. We propose to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks. Our proposed method is able to achieve 1.7 points higher accuracy in GLUE score than current practices.
Score: 16.464030458567187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural Language Processing (NLP) has recently achieved success by using huge pre-trained Transformer networks. However, these models often contain hundreds of millions or even billions of parameters, bringing challenges to online deployment due to latency constraints. Recently, hardware manufacturers have introduced dedicated hardware for NxM sparsity to provide the flexibility of unstructured pruning with the runtime efficiency of structured approaches. NxM sparsity permits arbitrarily selecting M parameters to retain from a contiguous group of N in the dense representation. However, due to the extremely high complexity of pre-trained models, the standard sparse fine-tuning techniques often fail to generalize well on downstream tasks, which have limited data resources. To address such an issue in a principled manner, we introduce a new learning framework, called NxMTransformer, to induce NxM semi-structured sparsity on pretrained language models for natural language understanding to obtain better performance. In particular, we propose to formulate the NxM sparsity as a constrained optimization problem and use Alternating Direction Method of Multipliers (ADMM) to optimize the downstream tasks while taking the underlying hardware constraints into consideration. ADMM decomposes the NxM sparsification problem into two sub-problems that can be solved sequentially, generating sparsified Transformer networks that achieve high accuracy while being able to effectively execute on newly released hardware. We apply our approach to a wide range of NLP tasks, and our proposed method is able to achieve 1.7 points higher accuracy in GLUE score than current practices. Moreover, we perform detailed analysis on our approach and shed light on how ADMM affects fine-tuning accuracy for downstream tasks. Finally, we illustrate how NxMTransformer achieves performance improvement with knowledge distillation.

Related papers

Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity [2.78181759570722]
Low bit-precisions and their bit-slice sparsity have been studied to accelerate general matrix-multiplications (GEMM) during large-scale deep neural network (DNN) inferences. Recent studies have actively utilized asymmetric quantization for activations without requiring additional operations. This paper proposes an Asymmetrically-Quantized bit-Slice GEMM for the first time.
arXiv Detail & Related papers (2024-12-13T11:44:09Z)
Progressive Mixed-Precision Decoding for Efficient LLM Inference [49.05448842542558]
We introduce Progressive Mixed-Precision Decoding (PMPD) to address the memory-boundedness of decoding. PMPD achieves 1.4$-$12.2$times$ speedup in matrix-vector multiplications over fp16 models. Our approach delivers a throughput gain of 3.8$-$8.0$times$ over fp16 models and up to 1.54$times$ over uniform quantization approaches.
arXiv Detail & Related papers (2024-10-17T11:46:33Z)
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities. In-Context Learning (ICL) and. Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting. LLMs to downstream tasks. We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z)
N-1 Reduced Optimal Power Flow Using Augmented Hierarchical Graph Neural Network [0.2900810893770134]
AHGNN-enabled N-1 ROPF can result in a remarkable reduction in computing time while retaining the solution quality. Case studies prove the proposed AHGNN and the associated N-1 ROPF are highly effective in reducing computation time while preserving solution quality.
arXiv Detail & Related papers (2024-02-09T07:23:27Z)
Dynamically configured physics-informed neural network in topology optimization applications [4.403140515138818]
The physics-informed neural network (PINN) can avoid generating enormous amounts of data when solving forward problems. A dynamically configured PINN-based topology optimization (DCPINN-TO) method is proposed. The accuracy of the displacement prediction and optimization results indicate that the DCPINN-TO method is effective and efficient.
arXiv Detail & Related papers (2023-12-12T05:35:30Z)
E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity [6.434967516411846]
We introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect. E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.
arXiv Detail & Related papers (2023-10-24T15:27:15Z)
Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers [66.823588073584]
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. We propose a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs.
arXiv Detail & Related papers (2023-10-02T02:01:16Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
Learning to Solve the AC-OPF using Sensitivity-Informed Deep Neural Networks [52.32646357164739]
We propose a deep neural network (DNN) to solve the solutions of the optimal power flow (ACOPF) The proposed SIDNN is compatible with a broad range of OPF schemes. It can be seamlessly integrated in other learning-to-OPF schemes.
arXiv Detail & Related papers (2021-03-27T00:45:23Z)
Joint Deep Reinforcement Learning and Unfolding: Beam Selection and Precoding for mmWave Multiuser MIMO with Lens Arrays [54.43962058166702]
millimeter wave (mmWave) multiuser multiple-input multiple-output (MU-MIMO) systems with discrete lens arrays have received great attention. In this work, we investigate the joint design of a beam precoding matrix for mmWave MU-MIMO systems with DLA.
arXiv Detail & Related papers (2021-01-05T03:55:04Z)
Deep unfolding of the weighted MMSE beamforming algorithm [9.518010235273783]
We propose the novel application of deep unfolding to the WMMSE algorithm for a MISO downlink channel. Deep unfolding naturally incorporates expert knowledge, with the benefits of immediate and well-grounded architecture selection, fewer trainable parameters, and better explainability. By means of simulations, we show that, in most of the settings, the unfolded WMMSE outperforms or performs equally to the WMMSE for a fixed number of iterations.
arXiv Detail & Related papers (2020-06-15T14:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.