Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive
Analysis of Hebrew BERT Models and a New One to Outperform Them All
- URL: http://arxiv.org/abs/2211.15199v2
- Date: Mon, 15 May 2023 18:16:26 GMT
- Title: Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive
Analysis of Hebrew BERT Models and a New One to Outperform Them All
- Authors: Eylon Gueta, Avi Shmidman, Shaltiel Shmidman, Cheyn Shmuel Shmidman,
Joshua Guedalia, Moshe Koppel, Dan Bareket, Amit Seker, Reut Tsarfaty
- Abstract summary: We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before.
We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance.
Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks.
- Score: 8.964815786230686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new pre-trained language model (PLM) for modern Hebrew, termed
AlephBERTGimmel, which employs a much larger vocabulary (128K items) than
standard Hebrew PLMs before. We perform a contrastive analysis of this model
against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the
effects of larger vocabularies on task performance. Our experiments show that
larger vocabularies lead to fewer splits, and that reducing splits is better
for model performance, across different tasks. All in all this new model
achieves new SOTA on all available Hebrew benchmarks, including Morphological
Segmentation, POS Tagging, Full Morphological Analysis, NER, and Sentiment
Analysis. Subsequently we advocate for PLMs that are larger not only in terms
of number of layers or training data, but also in terms of their vocabulary. We
release the new model publicly for unrestricted use.
Related papers
- Large Vocabulary Size Improves Large Language Models [28.83786065307658]
We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs)
Experimental results show that larger vocabulary sizes lead to better performance in LLMs.
We introduce a simple method to use a new vocabulary instead of the pre-defined one.
arXiv Detail & Related papers (2024-06-24T10:27:07Z) - mALBERT: Is a Compact Multilingual BERT Model Still Worth It? [5.2116647104135305]
We propose to focus on smaller models, such as compact models like ALBERT, which are more virtuous than these PLMs.
PLMs enable huge breakthroughs in Natural Language Processing tasks, such as Spoken and Natural LanguageUnderstanding, classification, Question-Answering tasks.
Considering these facts, wepropose the free release of the first version of a multilingual compact ALBERT model, pre-trained using Wikipediadata.
arXiv Detail & Related papers (2024-03-27T08:25:28Z) - DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew [2.421705925711388]
We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew.
We release three fine-tuned versions of the model, designed to perform three foundational tasks in the analysis of Hebrew texts.
arXiv Detail & Related papers (2023-08-31T12:43:18Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - Towards Efficient NLP: A Standard Evaluation and A Strong Baseline [55.29756535335831]
This work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models.
Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic.
arXiv Detail & Related papers (2021-10-13T21:17:15Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - Pretrained Language Model Embryology: The Birth of ALBERT [68.5801642674541]
We investigate the developmental process from a set of randomly parameters to a totipotent language model.
Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining.
These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.
arXiv Detail & Related papers (2020-10-06T05:15:39Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.