DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
- URL: http://arxiv.org/abs/2308.16687v2
- Date: Fri, 13 Oct 2023 13:56:28 GMT
- Title: DictaBERT: A State-of-the-Art BERT Suite for Modern Hebrew
- Authors: Shaltiel Shmidman, Avi Shmidman, Moshe Koppel
- Abstract summary: We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew.
We release three fine-tuned versions of the model, designed to perform three foundational tasks in the analysis of Hebrew texts.
- Score: 2.421705925711388
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present DictaBERT, a new state-of-the-art pre-trained BERT model for
modern Hebrew, outperforming existing models on most benchmarks. Additionally,
we release three fine-tuned versions of the model, designed to perform three
specific foundational tasks in the analysis of Hebrew texts: prefix
segmentation, morphological tagging and question answering. These fine-tuned
models allow any developer to perform prefix segmentation, morphological
tagging and question answering of a Hebrew input with a single call to a
HuggingFace model, without the need to integrate any additional libraries or
code. In this paper we describe the details of the training as well and the
results on the different benchmarks. We release the models to the community,
along with sample code demonstrating their use. We release these models as part
of our goal to help further research and development in Hebrew NLP.
Related papers
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Introducing DictaLM -- A Large Generative Language Model for Modern
Hebrew [2.1547347528250875]
We present DictaLM, a large-scale language model tailored for Modern Hebrew.
As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license.
arXiv Detail & Related papers (2023-09-25T22:42:09Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive
Analysis of Hebrew BERT Models and a New One to Outperform Them All [8.964815786230686]
We present a new pre-trained language model (PLM) for modern Hebrew, termed AlephBERTGimmel, which employs a much larger vocabulary (128K items) than standard Hebrew PLMs before.
We perform a contrastive analysis of this model against all previous Hebrew PLMs (mBERT, heBERT, AlephBERT) and assess the effects of larger vocabularies on task performance.
Our experiments show that larger vocabularies lead to fewer splits, and that reducing splits is better for model performance, across different tasks.
arXiv Detail & Related papers (2022-11-28T10:17:35Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Z-Code++: A Pre-trained Language Model Optimized for Abstractive
Summarization [108.09419317477986]
Z-Code++ is a new pre-trained language model optimized for abstractive text summarization.
The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation.
Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum.
arXiv Detail & Related papers (2022-08-21T01:00:54Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - What the [MASK]? Making Sense of Language-Specific BERT Models [39.54532211263058]
This paper presents the current state of the art in language-specific BERT models.
Our aim is to provide an overview of the commonalities and differences between Language-language-specific BERT models and mBERT models.
arXiv Detail & Related papers (2020-03-05T20:42:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.