Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
- URL: http://arxiv.org/abs/2210.05043v2
- Date: Sat, 20 May 2023 21:47:18 GMT
- Title: Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
- Authors: Haw-Shiuan Chang, Ruei-Yao Sun, Kathryn Ricci, Andrew McCallum
- Abstract summary: Ensembling BERT models often significantly improves accuracy, but at the cost of computation and memory footprint.
We propose Multi-BERT, a novel ensembling method for CLS-based prediction tasks.
In experiments on GLUE and SuperGLUE we show that our Multi- BERT reliably improves both overall accuracy and confidence estimation.
- Score: 34.88128747535637
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensembling BERT models often significantly improves accuracy, but at the cost
of significantly more computation and memory footprint. In this work, we
propose Multi-CLS BERT, a novel ensembling method for CLS-based prediction
tasks that is almost as efficient as a single BERT model. Multi-CLS BERT uses
multiple CLS tokens with a parameterization and objective that encourages their
diversity. Thus instead of fine-tuning each BERT model in an ensemble (and
running them all at test time), we need only fine-tune our single Multi-CLS
BERT model (and run the one model at test time, ensembling just the multiple
final CLS embeddings). To test its effectiveness, we build Multi-CLS BERT on
top of a state-of-the-art pretraining method for BERT (Aroca-Ouellette and
Rudzicz, 2020). In experiments on GLUE and SuperGLUE we show that our Multi-CLS
BERT reliably improves both overall accuracy and confidence estimation. When
only 100 training samples are available in GLUE, the Multi-CLS BERT_Base model
can even outperform the corresponding BERT_Large model. We analyze the behavior
of our Multi-CLS BERT, showing that it has many of the same characteristics and
behavior as a typical BERT 5-way ensemble, but with nearly 4-times less
computation and memory.
Related papers
- MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation [1.2699007098398807]
MaxBERT refines the [BERT] representation by aggregating information across layers and tokens.<n>Our approach enhances BERT's classification accuracy without requiring pre-training or significantly increasing model size.
arXiv Detail & Related papers (2025-05-21T16:10:02Z) - Breaking the Token Barrier: Chunking and Convolution for Efficient Long
Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks.
BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input.
We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Finding the Winning Ticket of BERT for Binary Text Classification via
Adaptive Layer Truncation before Fine-tuning [7.797987384189306]
We construct a series of BERT-based models with different size and compare their predictions on 8 binary classification tasks.
The results show there truly exist smaller sub-networks performing better than the full model.
arXiv Detail & Related papers (2021-11-22T02:22:47Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - Bertinho: Galician BERT Representations [14.341471404165349]
This paper presents a monolingual BERT model for Galician.
We release two models, built using 6 and 12 transformer layers, respectively.
We show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
arXiv Detail & Related papers (2021-03-25T12:51:34Z) - BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks [0.5893124686141781]
We propose a novel Boosting BERT model to integrate multi-class boosting into the BERT.
We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks.
arXiv Detail & Related papers (2020-09-13T09:07:14Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.