Restoring Hebrew Diacritics Without a Dictionary
- URL: http://arxiv.org/abs/2105.05209v1
- Date: Tue, 11 May 2021 17:23:29 GMT
- Title: Restoring Hebrew Diacritics Without a Dictionary
- Authors: Elazar Gershuni, Yuval Pinter
- Abstract summary: We show that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.
We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems.
- Score: 4.733760777271136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We demonstrate that it is feasible to diacritize Hebrew script without any
human-curated resources other than plain diacritized text. We present NAKDIMON,
a two-layer character level LSTM, that performs on par with much more
complicated curation-dependent systems, across a diverse array of modern Hebrew
sources.
Related papers
- MenakBERT -- Hebrew Diacriticizer [0.13654846342364307]
We present MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences.
We show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
arXiv Detail & Related papers (2024-10-03T12:07:34Z) - Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew [12.320161893898735]
HeSum is a benchmark specifically designed for abstractive text summarization in Modern Hebrew.
HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals.
Linguistic analysis confirms HeSum's high abstractness and unique morphological challenges.
arXiv Detail & Related papers (2024-06-06T09:36:14Z) - Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language [3.0663766446277845]
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel.
Berel is trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms.
We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs.
arXiv Detail & Related papers (2022-08-03T06:59:04Z) - Data Augmentation for Sign Language Gloss Translation [115.13684506803529]
Sign language translation (SLT) is often decomposed into video-to-gloss recognition and gloss-totext translation.
We focus here on gloss-to-text translation, which we treat as a low-resource neural machine translation (NMT) problem.
By pre-training on the thus obtained synthetic data, we improve translation from American Sign Language (ASL) to English and German Sign Language (DGS) to German by up to 3.14 and 2.20 BLEU, respectively.
arXiv Detail & Related papers (2021-05-16T16:37:36Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Building a Hebrew Semantic Role Labeling Lexical Resource from Parallel
Movie Subtitles [4.089055556130724]
We present a semantic role labeling resource for Hebrew built semi-automatically through annotation projection from English.
This corpus is derived from the multilingual OpenSubtitles dataset and includes short informal sentences.
We provide a fully annotated version of the data including morphological analysis, dependency syntax and semantic role labeling in both FrameNet and PropBank styles.
We train a neural SRL model on this Hebrew resource exploiting the pre-trained multilingual BERT transformer model, and provide the first available baseline model for Hebrew SRL as a reference point.
arXiv Detail & Related papers (2020-05-17T10:03:42Z) - Nakdan: Professional Hebrew Diacritizer [43.58927359102219]
We present a system for automatic diacritization of Hebrew text.
The system combines modern neural models with carefully curated declarative linguistic knowledge.
The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew.
arXiv Detail & Related papers (2020-05-07T08:15:55Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.