Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language
- URL: http://arxiv.org/abs/2208.01875v1
- Date: Wed, 3 Aug 2022 06:59:04 GMT
- Title: Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language
- Authors: Avi Shmidman, Joshua Guedalia, Shaltiel Shmidman, Cheyn Shmuel
Shmidman, Eli Handel, Moshe Koppel
- Abstract summary: We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel.
Berel is trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms.
We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs.
- Score: 3.0663766446277845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed
Berel (BERT Embeddings for Rabbinic-Encoded Language). Whilst other PLMs exist
for processing Hebrew texts (e.g., HeBERT, AlephBert), they are all trained on
modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms
of its lexicographical, morphological, syntactic and orthographic norms. We
demonstrate the superiority of Berel on Rabbinic texts via a challenge set of
Hebrew homographs. We release the new model and homograph challenge set for
unrestricted use.
Related papers
- MenakBERT -- Hebrew Diacriticizer [0.13654846342364307]
We present MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences.
We show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
arXiv Detail & Related papers (2024-10-03T12:07:34Z) - Modular Adaptation of Multilingual Encoders to Written Swiss German
Dialect [52.1701152610258]
Adding a Swiss German adapter to a modular encoder achieves 97.5% of fully monolithic adaptation performance.
For the task of retrieving Swiss German sentences given Standard German queries, adapting a character-level model is more effective than the other adaptation strategies.
arXiv Detail & Related papers (2024-01-25T18:59:32Z) - Introducing DictaLM -- A Large Generative Language Model for Modern
Hebrew [2.1547347528250875]
We present DictaLM, a large-scale language model tailored for Modern Hebrew.
As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license.
arXiv Detail & Related papers (2023-09-25T22:42:09Z) - Restoring Hebrew Diacritics Without a Dictionary [4.733760777271136]
We show that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.
We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems.
arXiv Detail & Related papers (2021-05-11T17:23:29Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Nakdan: Professional Hebrew Diacritizer [43.58927359102219]
We present a system for automatic diacritization of Hebrew text.
The system combines modern neural models with carefully curated declarative linguistic knowledge.
The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew.
arXiv Detail & Related papers (2020-05-07T08:15:55Z) - Revisiting Pre-Trained Models for Chinese Natural Language Processing [73.65780892128389]
We revisit Chinese pre-trained language models to examine their effectiveness in a non-English language.
We also propose a model called MacBERT, which improves upon RoBERTa in several ways.
arXiv Detail & Related papers (2020-04-29T02:08:30Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.