edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages
with Abugida Scripts
- URL: http://arxiv.org/abs/2101.03916v2
- Date: Mon, 29 Mar 2021 19:07:01 GMT
- Title: edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages
with Abugida Scripts
- Authors: Sourav Ghosh, Sourabh Vasant Gothe, Chandramouli Sanchi, Barath Raj
Kandur Raja
- Abstract summary: Abugida refers to a phonogram writing system where each syllable is represented using a single consonant or typographic ligature.
We propose a disambiguation algorithm and showcase its usefulness in two novel input methods for languages using the abugida writing system.
We show an improvement in typing speed by 19.49%, 25.13%, and 14.89%, in Hindi, Bengali, and Thai, respectively, using Ambiguous Input.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Abugida refers to a phonogram writing system where each syllable is
represented using a single consonant or typographic ligature, along with a
default vowel or optional diacritic(s) to denote other vowels. However, texting
in these languages has some unique challenges in spite of the advent of devices
with soft keyboard supporting custom key layouts. The number of characters in
these languages is large enough to require characters to be spread over
multiple views in the layout. Having to switch between views many times to type
a single word hinders the natural thought process. This prevents popular usage
of native keyboard layouts. On the other hand, supporting romanized scripts
(native words transcribed using Latin characters) with language model based
suggestions is also set back by the lack of uniform romanization rules.
To this end, we propose a disambiguation algorithm and showcase its
usefulness in two novel mutually non-exclusive input methods for languages
natively using the abugida writing system: (a) disambiguation of ambiguous
input for abugida scripts, and (b) disambiguation of word variants in romanized
scripts. We benchmark these approaches using public datasets, and show an
improvement in typing speed by 19.49%, 25.13%, and 14.89%, in Hindi, Bengali,
and Thai, respectively, using Ambiguous Input, owing to the human ease of
locating keys combined with the efficiency of our inference method. Our Word
Variant Disambiguation (WDA) maps valid variants of romanized words, previously
treated as Out-of-Vocab, to a vocabulary of 100k words with high accuracy,
leading to an increase in Error Correction F1 score by 10.03% and Next Word
Prediction (NWP) by 62.50% on average.
Related papers
- Bukva: Russian Sign Language Alphabet [75.42794328290088]
This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl.
Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language.
We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition.
arXiv Detail & Related papers (2024-10-11T09:59:48Z) - Homonym Sense Disambiguation in the Georgian Language [49.1574468325115]
This research proposes a novel approach to the Word Sense Disambiguation (WSD) task in the Georgian language.
It is based on supervised fine-tuning of a pre-trained Large Language Model (LLM) on a dataset formed by filtering the Georgian Common Crawls corpus.
arXiv Detail & Related papers (2024-04-24T21:48:43Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - Unicode Normalization and Grapheme Parsing of Indic Languages [2.974799610163104]
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units.
Our proposed normalizer is a more efficient and effective tool than the previously used Indic normalizer.
We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
arXiv Detail & Related papers (2023-05-11T14:34:08Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Handling Compounding in Mobile Keyboard Input [7.309321705635677]
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.
Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models.
We show that this method brings around 20% word error rate reduction in a variety of compounding languages.
arXiv Detail & Related papers (2022-01-17T15:28:58Z) - Phoneme Recognition through Fine Tuning of Phonetic Representations: a
Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation.
To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda.
We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z) - Disentangling Homophemes in Lip Reading using Perplexity Analysis [10.262299768603894]
This paper proposes a new application for the Generative Pre-Training transformer.
It serves as a language model to convert visual speech in the form of visemes, to language in the form of words and sentences.
The network uses the search for optimal perplexity to perform the viseme-to-word mapping.
arXiv Detail & Related papers (2020-11-28T12:12:17Z) - Phonotactic Complexity and its Trade-offs [73.10961848460613]
This simple measure allows us to compare the entropy across languages.
We demonstrate a very strong negative correlation of -0.74 between bits per phoneme and the average length of words.
arXiv Detail & Related papers (2020-05-07T21:36:59Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.