MenakBERT -- Hebrew Diacriticizer
- URL: http://arxiv.org/abs/2410.02417v1
- Date: Thu, 3 Oct 2024 12:07:34 GMT
- Title: MenakBERT -- Hebrew Diacriticizer
- Authors: Ido Cohen, Jacob Gidron, Idan Pinto,
- Abstract summary: We present MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences.
We show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
- Score: 0.13654846342364307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diacritical marks in the Hebrew language give words their vocalized form. The task of adding diacritical marks to plain Hebrew text is still dominated by a system that relies heavily on human-curated resources. Recent models trained on diacritized Hebrew texts still present a gap in performance. We use a recently developed char-based PLM to narrowly bridge this gap. Presenting MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences. We continue to show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
Related papers
- A Language Modeling Approach to Diacritic-Free Hebrew TTS [21.51896995655732]
We tackle the task of text-to-speech (TTS) in Hebrew.
Traditional Hebrew contains Diacritics, which dictate the way individuals should pronounce given words.
The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation.
arXiv Detail & Related papers (2024-07-16T22:43:49Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language [3.0663766446277845]
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel.
Berel is trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms.
We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs.
arXiv Detail & Related papers (2022-08-03T06:59:04Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Restoring Hebrew Diacritics Without a Dictionary [4.733760777271136]
We show that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.
We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems.
arXiv Detail & Related papers (2021-05-11T17:23:29Z) - AlephBERT:A Hebrew Large Pre-Trained Language Model to Start-off your
Hebrew NLP Application With [7.345047237652976]
Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology.
While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between.
arXiv Detail & Related papers (2021-04-08T20:51:29Z) - Nakdan: Professional Hebrew Diacritizer [43.58927359102219]
We present a system for automatic diacritization of Hebrew text.
The system combines modern neural models with carefully curated declarative linguistic knowledge.
The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew.
arXiv Detail & Related papers (2020-05-07T08:15:55Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.