Optimizing small BERTs trained for German NER
- URL: http://arxiv.org/abs/2104.11559v1
- Date: Fri, 23 Apr 2021 12:36:13 GMT
- Title: Optimizing small BERTs trained for German NER
- Authors: Jochen Z\"ollner, Konrad Sperfeld, Christoph Wick, Roger Labahn
- Abstract summary: We investigate various training techniques of smaller BERT models and evaluate them on five public German NER tasks.
We propose two new fine-tuning techniques leading to better performance: CSE-tagging and a modified form of LCRF.
Furthermore, we introduce a new technique called WWA which reduces BERT memory usage and leads to a small increase in performance.
- Score: 0.16058099298620418
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Currently, the most widespread neural network architecture for training
language models is the so called BERT which led to improvements in various NLP
tasks. In general, the larger the number of parameters in a BERT model, the
better the results obtained in these NLP tasks. Unfortunately, the memory
consumption and the training duration drastically increases with the size of
these models, though. In this article, we investigate various training
techniques of smaller BERT models and evaluate them on five public German NER
tasks of which two are introduced by this article. We combine different methods
from other BERT variants like ALBERT, RoBERTa, and relative positional
encoding. In addition, we propose two new fine-tuning techniques leading to
better performance: CSE-tagging and a modified form of LCRF. Furthermore, we
introduce a new technique called WWA which reduces BERT memory usage and leads
to a small increase in performance.
Related papers
- LegalTurk Optimized BERT for Multi-Label Text Classification and NER [0.0]
We introduce our innovative modified pre-training approach by combining diverse masking strategies.
In this work, we focus on two essential downstream tasks in the legal domain: name entity recognition and multi-label text classification.
Our modified approach demonstrated significant improvements in both NER and multi-label text classification tasks compared to the original BERT model.
arXiv Detail & Related papers (2024-06-30T10:19:54Z) - SpikeBERT: A Language Spikformer Learned from BERT with Knowledge
Distillation [31.777019330200705]
Spiking neural networks (SNNs) offer a promising avenue to implement deep neural networks in a more energy-efficient way.
We improve a recently-proposed spiking Transformer (i.e., Spikformer) to make it possible to process language tasks.
We show that the models trained with our method, named SpikeBERT, outperform state-of-the-art SNNs and even achieve comparable results to BERTs on text classification tasks for both English and Chinese.
arXiv Detail & Related papers (2023-08-29T08:41:16Z) - BERTino: an Italian DistilBERT model [0.0]
We present BERTino, a DistilBERT model which proposes to be the first lightweight alternative to the BERT architecture specific for the Italian language.
We evaluate BERTino on the Italian ISDT, Italian ParTUT, Italian WikiNER and multiclass classification tasks, obtaining F1 scores comparable to those obtained by a BERTBASE with a remarkable improvement in training and inference speed.
arXiv Detail & Related papers (2023-03-31T15:07:40Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks [0.5893124686141781]
We propose a novel Boosting BERT model to integrate multi-class boosting into the BERT.
We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks.
arXiv Detail & Related papers (2020-09-13T09:07:14Z) - DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference [69.93692147242284]
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
We propose a simple but effective method, DeeBERT, to accelerate BERT inference.
Experiments show that DeeBERT is able to save up to 40% inference time with minimal degradation in model quality.
arXiv Detail & Related papers (2020-04-27T17:58:05Z) - DynaBERT: Dynamic BERT with Adaptive Width and Depth [55.18269622415814]
We propose a novel dynamic BERT model (abbreviated as DynaBERT)
It can flexibly adjust the size and latency by selecting adaptive width and depth.
It consistently outperforms existing BERT compression methods.
arXiv Detail & Related papers (2020-04-08T15:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.