Improving Yor\`ub\'a Diacritic Restoration
- URL: http://arxiv.org/abs/2003.10564v1
- Date: Mon, 23 Mar 2020 22:07:15 GMT
- Title: Improving Yor\`ub\'a Diacritic Restoration
- Authors: Iroro Orife, David I. Adelani, Timi Fasubaa, Victor Williamson,
Wuraola Fisayo Oyewusi, Olamilekan Wahab, Kola Tubosun
- Abstract summary: Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
- Score: 3.301896537513352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Yor\`ub\'a is a widely spoken West African language with a writing system
rich in orthographic and tonal diacritics. They provide morphological
information, are crucial for lexical disambiguation, pronunciation and are
vital for any computational Speech or Natural Language Processing tasks.
However diacritic marks are commonly excluded from electronic texts due to
limited device and application support as well as general education on proper
usage. We report on recent efforts at dataset cultivation. By aggregating and
improving disparate texts from the web and various personal libraries, we were
able to significantly grow our clean Yor\`ub\'a dataset from a majority
Bibilical text corpora with three sources to millions of tokens from over a
dozen sources. We evaluate updated diacritic restoration models on a new,
general purpose, public-domain Yor\`ub\'a evaluation dataset of modern
journalistic news text, selected to be multi-purpose and reflecting
contemporary usage. All pre-trained models, datasets and source-code have been
released as an open-source project to advance efforts on Yor\`ub\'a language
technology.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services [5.03606775899383]
"KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
arXiv Detail & Related papers (2023-10-06T15:19:39Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - A New Generation of Perspective API: Efficient Multilingual
Character-level Transformers [66.9176610388952]
We present the fundamentals behind the next version of the Perspective API from Google Jigsaw.
At the heart of the approach is a single multilingual token-free Charformer model.
We demonstrate that by forgoing static vocabularies, we gain flexibility across a variety of settings.
arXiv Detail & Related papers (2022-02-22T20:55:31Z) - Whose Language Counts as High Quality? Measuring Language Ideologies in
Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality.
We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Multitask Learning Approach for Diacritic Restoration [21.288912928687186]
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings.
Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word.
We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
arXiv Detail & Related papers (2020-06-07T01:20:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.