Related papers: InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer

URL: http://arxiv.org/abs/2503.15983v1
Date: Thu, 20 Mar 2025 09:30:35 GMT
Title: InhibiDistilbert: Knowledge Distillation for a ReLU and Addition-based Transformer
Authors: Tony Zhang, Rickard Brännvall,
Abstract summary: inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention.<n>We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture.
Score: 2.3020018305241337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work explores optimizing transformer-based language models by integrating model compression techniques with inhibitor attention, a novel alternative attention mechanism. Inhibitor attention employs Manhattan distances and ReLU activations instead of the matrix multiplications and softmax activation of the conventional scaled dot-product attention. This shift offers potential computational and energy savings while maintaining model effectiveness. We propose further adjustments to improve the inhibitor mechanism's training efficiency and evaluate its performance on the DistilBERT architecture. Our knowledge distillation experiments indicate that the modified inhibitor transformer model can achieve competitive performance on standard NLP benchmarks, including General Language Understanding Evaluation (GLUE) and sentiment analysis tasks.

Related papers

Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment. We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers. We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
LORTSAR: Low-Rank Transformer for Skeleton-based Action Recognition [4.375744277719009]
LORTSAR is applied to two leading Transformer-based models, "Hyperformer" and "STEP-CATFormer" Our method can reduce the number of model parameters substantially with negligible degradation or even performance increase in recognition accuracy. This confirms that SVD combined with post-compression fine-tuning can boost model efficiency, paving the way for more sustainable, lightweight, and high-performance technologies in human action recognition.
arXiv Detail & Related papers (2024-07-19T20:19:41Z)
EEGEncoder: Advancing BCI with Transformer-Based Motor Imagery Classification [11.687193535939798]
Brain-computer interfaces (BCIs) harness electroencephalographic signals for direct neural control of devices. Traditional machine learning methods for EEG-based motor imagery (MI) classification encounter challenges such as manual feature extraction and susceptibility to noise. This paper introduces EEGEncoder, a deep learning framework that employs modified transformers and TCNs to surmount these limitations.
arXiv Detail & Related papers (2024-04-23T09:51:24Z)
Adaptive Catalyst Discovery Using Multicriteria Bayesian Optimization with Representation Learning [17.00084254889438]
High-performance catalysts are crucial for sustainable energy conversion and human health. The discovery of catalysts faces challenges due to the absence of efficient approaches to navigating vast and high-dimensional structure and composition spaces.
arXiv Detail & Related papers (2024-04-18T18:11:06Z)
Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion [4.716845031095804]
Transformer models can face practical limitations due to their high computational requirements. Such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model.
arXiv Detail & Related papers (2023-10-06T16:34:51Z)
The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers [0.0]
We replace the dot-product and Softmax-based attention with an alternative mechanism involving addition and ReLU activation only. This side-steps the expansion to double precision often required by matrix multiplication and avoids costly Softmax evaluations. It can enable more efficient execution and support larger quantized Transformer models on resource-constrained hardware or alternative arithmetic systems like homomorphic encryption.
arXiv Detail & Related papers (2023-10-03T13:34:21Z)
Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning [1.1510009152620668]
This project investigates and applies knowledge distillation for BERT model compression. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate models for a range of natural language processing tasks.
arXiv Detail & Related papers (2023-08-26T20:59:21Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior [54.629850694790036]
spectral-normalized identity priors (SNIP) is a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance.
arXiv Detail & Related papers (2020-10-05T05:40:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.