A Multitask Learning Approach for Diacritic Restoration
- URL: http://arxiv.org/abs/2006.04016v1
- Date: Sun, 7 Jun 2020 01:20:40 GMT
- Title: A Multitask Learning Approach for Diacritic Restoration
- Authors: Sawsan Alqahtani and Ajay Mishra and Mona Diab
- Abstract summary: In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings.
Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word.
We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
- Score: 21.288912928687186
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many languages like Arabic, diacritics are used to specify pronunciations
as well as meanings. Such diacritics are often omitted in written text,
increasing the number of possible pronunciations and meanings for a word. This
results in a more ambiguous text making computational processing on such text
more difficult. Diacritic restoration is the task of restoring missing
diacritics in the written text. Most state-of-the-art diacritic restoration
models are built on character level information which helps generalize the
model to unseen data, but presumably lose useful information at the word level.
Thus, to compensate for this loss, we investigate the use of multi-task
learning to jointly optimize diacritic restoration with related NLP problems
namely word segmentation, part-of-speech tagging, and syntactic diacritization.
We use Arabic as a case study since it has sufficient data resources for tasks
that we consider in our joint modeling. Our joint models significantly
outperform the baselines and are comparable to the state-of-the-art models that
are more complex relying on morphological analyzers and/or a lot more data
(e.g. dialectal data).
Related papers
- Don't Touch My Diacritics [6.307256398189243]
We focus on the handling of diacritics in texts originating in many languages and scripts.
We demonstrate, through several case studies, the adverse effects of inconsistent encoding of diacritized characters and of removing diacritics altogether.
arXiv Detail & Related papers (2024-10-31T17:03:44Z) - Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Automatic Restoration of Diacritics for Speech Data Sets [1.81336359426598]
We explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances.
We use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances.
The proposed framework consistently improves diacritic restoration performance compared to text-only baselines.
arXiv Detail & Related papers (2023-11-15T19:24:54Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.