Handling Compounding in Mobile Keyboard Input
- URL: http://arxiv.org/abs/2201.06469v1
- Date: Mon, 17 Jan 2022 15:28:58 GMT
- Title: Handling Compounding in Mobile Keyboard Input
- Authors: Andreas Kabel, Keith Hall, Tom Ouyang, David Rybach, Daan van Esch,
Fran\c{c}oise Beaufays
- Abstract summary: This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.
Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models.
We show that this method brings around 20% word error rate reduction in a variety of compounding languages.
- Score: 7.309321705635677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a framework to improve the typing experience of mobile
users in morphologically rich languages. Smartphone keyboards typically support
features such as input decoding, corrections and predictions that all rely on
language models. For latency reasons, these operations happen on device, so the
models are of limited size and cannot easily cover all the words needed by
users for their daily tasks, especially in morphologically rich languages. In
particular, the compounding nature of Germanic languages makes their vocabulary
virtually infinite. Similarly, heavily inflecting and agglutinative languages
(e.g. Slavic, Turkic or Finno-Ugric languages) tend to have much larger
vocabularies than morphologically simpler languages, such as English or
Mandarin. We propose to model such languages with automatically selected
subword units annotated with what we call binding types, allowing the decoder
to know when to bind subword units into words. We show that this method brings
around 20% word error rate reduction in a variety of compounding languages.
This is more than twice the improvement we previously obtained with a more
basic approach, also described in the paper.
Related papers
- IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? [14.77467551053299]
Transformer-based models have revolutionized the field of natural language processing.
How robust are these models in encoding linguistic properties when faced with perturbations in the input text?
In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages.
arXiv Detail & Related papers (2024-10-03T15:50:08Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Language Model Tokenizers Introduce Unfairness Between Languages [98.92630681729518]
We show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked.
Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs.
We make the case that we should train future language models using multilingually fair subword tokenizers.
arXiv Detail & Related papers (2023-05-17T14:17:57Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Language-agnostic Multilingual Modeling [23.06484126933893]
We build a language-agnostic multilingual ASR system which transforms all languages to one writing system through a many-to-one transliteration transducer.
We show with four Indic languages, namely, Hindi, Bengali, Tamil and Kannada, that the language-agnostic multilingual model achieves up to 10% relative reduction in Word Error Rate (WER) over a language-dependent multilingual model.
arXiv Detail & Related papers (2020-04-20T18:57:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.