Related papers: EELBERT: Tiny Models through Dynamic Embeddings

EELBERT: Tiny Models through Dynamic Embeddings

URL: http://arxiv.org/abs/2310.20144v1
Date: Tue, 31 Oct 2023 03:28:08 GMT
Title: EELBERT: Tiny Models through Dynamic Embeddings
Authors: Gabrielle Cohn, Rishika Agarwal, Deepanshu Gupta and Siddharth Patwardhan
Abstract summary: EELBERT is an approach for compression of transformer-based models (e.g., BERT) It is achieved by replacing the input embedding layer of the model with dynamic, i.e. on-the-fly, embedding computations. We develop our smallest model UNO-EELBERT, which achieves a GLUE score within 4% of fully trained BERT-tiny.
Score: 0.28675177318965045
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce EELBERT, an approach for compression of transformer-based models (e.g., BERT), with minimal impact on the accuracy of downstream tasks. This is achieved by replacing the input embedding layer of the model with dynamic, i.e. on-the-fly, embedding computations. Since the input embedding layer accounts for a significant fraction of the model size, especially for the smaller BERT variants, replacing this layer with an embedding computation function helps us reduce the model size significantly. Empirical evaluation on the GLUE benchmark shows that our BERT variants (EELBERT) suffer minimal regression compared to the traditional BERT models. Through this approach, we are able to develop our smallest model UNO-EELBERT, which achieves a GLUE score within 4% of fully trained BERT-tiny, while being 15x smaller (1.2 MB) in size.

Related papers

On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages [0.4194295877935868]
This study explores the effectiveness of layer pruning for developing efficient BERT models tailored to specific downstream tasks in low-resource languages. We experiment with several BERT variants, including MahaBERT-v2 and Google-Muril, applying different pruning strategies and comparing their performance to smaller, scratch-trained models like MahaBERT-Small and MahaBERT-Smaller. Our findings demonstrate that pruned models, despite having fewer layers, achieve comparable performance to their fully-layered counterparts while consistently outperforming scratch-trained models of similar size.
arXiv Detail & Related papers (2025-01-01T05:30:10Z)
Towards Building Efficient Sentence BERT Models using Layer Pruning [0.4915744683251151]
This study examines the effectiveness of layer pruning in creating efficient Sentence BERT (SBERT) models. Our goal is to create smaller sentence embedding models that reduce complexity while maintaining strong embedding similarity.
arXiv Detail & Related papers (2024-09-21T15:10:06Z)
Can persistent homology whiten Transformer-based black-box models? A case study on BERT compression [0.0]
We propose Optimus BERT Compression and Explainability (OBCE) to bring explainability to BERT models. Our methodology can "whiten" BERT models by providing explainability to its neurons and reducing the model's size.
arXiv Detail & Related papers (2023-12-17T12:33:50Z)
Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks. BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z)
DPBERT: Efficient Inference for BERT based on Dynamic Planning [11.680840266488884]
Existing input-adaptive inference methods fail to take full advantage of the structure of BERT. We propose Dynamic Planning in BERT, a novel fine-tuning strategy that can accelerate the inference process of BERT. Our method reduces latency to 75% while maintaining 98% accuracy, yielding a better accuracy-speed trade-off compared to state-of-the-art input-adaptive methods.
arXiv Detail & Related papers (2023-07-26T07:18:50Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Deploying a BERT-based Query-Title Relevance Classifier in a Production System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks. It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size. We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM) BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z)
BinaryBERT: Pushing the Limit of BERT Quantization [74.65543496761553]
We propose BinaryBERT, which pushes BERT quantization to the limit with weight binarization. We find that a binary BERT is hard to be trained directly than a ternary counterpart due to its complex and irregular loss landscapes. Empirical results show that BinaryBERT has negligible performance drop compared to the full-precision BERT-base.
arXiv Detail & Related papers (2020-12-31T16:34:54Z)
TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model. Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z)
Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining) We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets. As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
BERT-of-Theseus: Compressing BERT by Progressive Module Replacing [113.48041857222431]
Our approach first divides the original BERT into several modules and builds their compact substitutes. We randomly replace the original modules with their substitutes to train the compact modules to mimic the behavior of the original modules.
arXiv Detail & Related papers (2020-02-07T17:52:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.