BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks
- URL: http://arxiv.org/abs/2009.05959v1
- Date: Sun, 13 Sep 2020 09:07:14 GMT
- Title: BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks
- Authors: Tongwen Huang, Qingyun She, Junlin Zhang
- Abstract summary: We propose a novel Boosting BERT model to integrate multi-class boosting into the BERT.
We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks.
- Score: 0.5893124686141781
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As a pre-trained Transformer model, BERT (Bidirectional Encoder
Representations from Transformers) has achieved ground-breaking performance on
multiple NLP tasks. On the other hand, Boosting is a popular ensemble learning
technique which combines many base classifiers and has been demonstrated to
yield better generalization performance in many machine learning tasks. Some
works have indicated that ensemble of BERT can further improve the application
performance. However, current ensemble approaches focus on bagging or stacking
and there has not been much effort on exploring the boosting. In this work, we
proposed a novel Boosting BERT model to integrate multi-class boosting into the
BERT. Our proposed model uses the pre-trained Transformer as the base
classifier to choose harder training sets to fine-tune and gains the benefits
of both the pre-training language knowledge and boosting ensemble in NLP tasks.
We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU
benchmarks. Experimental results demonstrate that our proposed model
significantly outperforms BERT on all datasets and proves its effectiveness in
many NLP tasks. Replacing the BERT base with RoBERTa as base classifier,
BoostingBERT achieves new state-of-the-art results in several NLP Tasks. We
also use knowledge distillation within the "teacher-student" framework to
reduce the computational overhead and model storage of BoostingBERT while
keeping its performance for practical application.
Related papers
- Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask
Training [55.43088293183165]
Recent studies show that pre-trained language models (PLMs) like BERT contain matchingworks that have similar transfer learning performance as the original PLM.
In this paper, we find that the BERTworks have even more potential than these studies have shown.
We train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork.
arXiv Detail & Related papers (2022-04-24T08:42:47Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - Optimizing small BERTs trained for German NER [0.16058099298620418]
We investigate various training techniques of smaller BERT models and evaluate them on five public German NER tasks.
We propose two new fine-tuning techniques leading to better performance: CSE-tagging and a modified form of LCRF.
Furthermore, we introduce a new technique called WWA which reduces BERT memory usage and leads to a small increase in performance.
arXiv Detail & Related papers (2021-04-23T12:36:13Z) - Evaluation of BERT and ALBERT Sentence Embedding Performance on
Downstream NLP Tasks [4.955649816620742]
This paper explores on sentence embedding models for BERT and ALBERT.
We take a modified BERT network with siamese and triplet network structures called Sentence-BERT (SBERT) and replace BERT with ALBERT to create Sentence-ALBERT (SALBERT)
arXiv Detail & Related papers (2021-01-26T09:14:06Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z) - Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z) - TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for
Efficient Retrieval [11.923682816611716]
We present TwinBERT model for effective and efficient retrieval.
It has twin-structured BERT-like encoders to represent query and document respectively.
It allows document embeddings to be pre-computed offline and cached in memory.
arXiv Detail & Related papers (2020-02-14T22:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.