German's Next Language Model
- URL: http://arxiv.org/abs/2010.10906v4
- Date: Thu, 3 Dec 2020 11:02:43 GMT
- Title: German's Next Language Model
- Authors: Branden Chan, Stefan Schweter, Timo M\"oller
- Abstract summary: We present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA.
By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition tasks.
- Score: 0.8520624117635327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we present the experiments which lead to the creation of our
BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying
the input training data, model size, and the presence of Whole Word Masking
(WWM) we were able to attain SoTA performance across a set of document
classification and named entity recognition (NER) tasks for both models of base
and large size. We adopt an evaluation driven approach in training these models
and our results indicate that both adding more data and utilizing WWM improve
model performance. By benchmarking against existing German models, we show that
these models are the best German models to date. Our trained models will be
made publicly available to the research community.
Related papers
- ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Confidence-based Ensembles of End-to-End Speech Recognition Models [71.65982591023581]
We show that a confidence-based ensemble of 5 monolingual models outperforms a system where model selection is performed via a dedicated language identification block.
We also demonstrate that it is possible to combine base and adapted models to achieve strong results on both original and target data.
arXiv Detail & Related papers (2023-06-27T23:13:43Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - DUMB: A Benchmark for Smart Evaluation of Dutch Models [23.811515104842826]
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks.
Relative Error Reduction (RER) compares the DUMB performance of language models to a strong baseline.
Highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base)
arXiv Detail & Related papers (2023-05-22T13:27:37Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements.
We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Comparative Study of Language Models on Cross-Domain Data with Model
Agnostic Explainability [0.0]
The study compares the state-of-the-art language models - BERT, ELECTRA and its derivatives which include RoBERTa, ALBERT and DistilBERT.
The experimental results establish new state-of-the-art for 2013 rating classification task and Financial Phrasebank sentiment detection task with 69% accuracy and 88.2% accuracy respectively.
arXiv Detail & Related papers (2020-09-09T04:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.