Related papers: Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling

URL: http://arxiv.org/abs/2210.05043v2
Date: Sat, 20 May 2023 21:47:18 GMT
Title: Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
Authors: Haw-Shiuan Chang, Ruei-Yao Sun, Kathryn Ricci, Andrew McCallum
Abstract summary: Ensembling BERT models often significantly improves accuracy, but at the cost of computation and memory footprint. We propose Multi-BERT, a novel ensembling method for CLS-based prediction tasks. In experiments on GLUE and SuperGLUE we show that our Multi- BERT reliably improves both overall accuracy and confidence estimation.
Score: 34.88128747535637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensembling BERT models often significantly improves accuracy, but at the cost of significantly more computation and memory footprint. In this work, we propose Multi-CLS BERT, a novel ensembling method for CLS-based prediction tasks that is almost as efficient as a single BERT model. Multi-CLS BERT uses multiple CLS tokens with a parameterization and objective that encourages their diversity. Thus instead of fine-tuning each BERT model in an ensemble (and running them all at test time), we need only fine-tune our single Multi-CLS BERT model (and run the one model at test time, ensembling just the multiple final CLS embeddings). To test its effectiveness, we build Multi-CLS BERT on top of a state-of-the-art pretraining method for BERT (Aroca-Ouellette and Rudzicz, 2020). In experiments on GLUE and SuperGLUE we show that our Multi-CLS BERT reliably improves both overall accuracy and confidence estimation. When only 100 training samples are available in GLUE, the Multi-CLS BERT_Base model can even outperform the corresponding BERT_Large model. We analyze the behavior of our Multi-CLS BERT, showing that it has many of the same characteristics and behavior as a typical BERT 5-way ensemble, but with nearly 4-times less computation and memory.

Related papers

MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation [1.2699007098398807]
MaxBERT refines the [BERT] representation by aggregating information across layers and tokens.<n>Our approach enhances BERT's classification accuracy without requiring pre-training or significantly increasing model size.
arXiv Detail & Related papers (2025-05-21T16:10:02Z)
Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks. BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input. We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z)
Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs) BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Finding the Winning Ticket of BERT for Binary Text Classification via Adaptive Layer Truncation before Fine-tuning [7.797987384189306]
We construct a series of BERT-based models with different size and compare their predictions on 8 binary classification tasks. The results show there truly exist smaller sub-networks performing better than the full model.
arXiv Detail & Related papers (2021-11-22T02:22:47Z)
Deploying a BERT-based Query-Title Relevance Classifier in a Production System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks. It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size. We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM) BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z)
Bertinho: Galician BERT Representations [14.341471404165349]
This paper presents a monolingual BERT model for Galician. We release two models, built using 6 and 12 transformer layers, respectively. We show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
arXiv Detail & Related papers (2021-03-25T12:51:34Z)
BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks [0.5893124686141781]
We propose a novel Boosting BERT model to integrate multi-class boosting into the BERT. We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks.
arXiv Detail & Related papers (2020-09-13T09:07:14Z)
ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost. We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z)
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. We propose a simple but effective method, DeeBERT, to accelerate BERT inference. Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.