Nakdan: Professional Hebrew Diacritizer
- URL: http://arxiv.org/abs/2005.03312v1
- Date: Thu, 7 May 2020 08:15:55 GMT
- Title: Nakdan: Professional Hebrew Diacritizer
- Authors: Avi Shmidman, Shaltiel Shmidman, Moshe Koppel, Yoav Goldberg
- Abstract summary: We present a system for automatic diacritization of Hebrew text.
The system combines modern neural models with carefully curated declarative linguistic knowledge.
The system supports Modern Hebrew, Rabbinic Hebrew and Poetic Hebrew.
- Score: 43.58927359102219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a system for automatic diacritization of Hebrew text. The system
combines modern neural models with carefully curated declarative linguistic
knowledge and comprehensive manually constructed tables and dictionaries.
Besides providing state of the art diacritization accuracy, the system also
supports an interface for manual editing and correction of the automatic
output, and has several features which make it particularly useful for
preparation of scientific editions of Hebrew texts. The system supports Modern
Hebrew, Rabbinic Hebrew and Poetic Hebrew. The system is freely accessible for
all use at http://nakdanpro.dicta.org.il.
Related papers
- MenakBERT -- Hebrew Diacriticizer [0.13654846342364307]
We present MenakBERT, a character level transformer pretrained on Hebrew text and fine-tuned to produce diacritical marks for Hebrew sentences.
We show how finetuning a model for diacritizing transfers to a task such as part of speech tagging.
arXiv Detail & Related papers (2024-10-03T12:07:34Z) - A Library for Automatic Natural Language Generation of Spanish Texts [6.102700502396687]
We present a novel system for natural language generation (NLG) of Spanish sentences from a minimum set of meaningful words.
The system is able to generate complete, coherent and correctly spelled sentences from the main word sets presented by the user.
It can be easily adapted to other languages by design and can feiblyas be integrated in a wide range of digital devices.
arXiv Detail & Related papers (2024-05-27T15:44:06Z) - Introducing DictaLM -- A Large Generative Language Model for Modern
Hebrew [2.1547347528250875]
We present DictaLM, a large-scale language model tailored for Modern Hebrew.
As a commitment to promoting research and development in the Hebrew language, we release both the foundation model and the instruct-tuned model under a Creative Commons license.
arXiv Detail & Related papers (2023-09-25T22:42:09Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language [3.0663766446277845]
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel.
Berel is trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms.
We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs.
arXiv Detail & Related papers (2022-08-03T06:59:04Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Restoring Hebrew Diacritics Without a Dictionary [4.733760777271136]
We show that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text.
We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems.
arXiv Detail & Related papers (2021-05-11T17:23:29Z) - Word Alignment by Fine-tuning Embeddings on Parallel Corpora [96.28608163701055]
Word alignment over parallel corpora has a wide variety of applications, including learning translation lexicons, cross-lingual transfer of language processing tools, and automatic evaluation or analysis of translation outputs.
Recently, other work has demonstrated that pre-trained contextualized word embeddings derived from multilingually trained language models (LMs) prove an attractive alternative, achieving competitive results on the word alignment task even in the absence of explicit training on parallel data.
In this paper, we examine methods to marry the two approaches: leveraging pre-trained LMs but fine-tuning them on parallel text with objectives designed to improve alignment quality, and proposing
arXiv Detail & Related papers (2021-01-20T17:54:47Z) - Breaking Writer's Block: Low-cost Fine-tuning of Natural Language
Generation Models [62.997667081978825]
We describe a system that fine-tunes a natural language generation model for the problem of solving Writer's Block.
The proposed fine-tuning obtains excellent results, even with a small number of epochs and a total cost of USD 150.
arXiv Detail & Related papers (2020-12-19T11:19:11Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.