Language Detection Engine for Multilingual Texting on Mobile Devices
- URL: http://arxiv.org/abs/2101.03963v1
- Date: Thu, 7 Jan 2021 16:49:47 GMT
- Title: Language Detection Engine for Multilingual Texting on Mobile Devices
- Authors: Sourabh Vasant Gothe, Sourav Ghosh, Sharmila Mani, Guggilla Bhanodai,
Ankur Agarwal, Chandramouli Sanchi
- Abstract summary: More than 2 billion mobile users worldwide type in multiple languages in the soft keyboard.
On a monolingual keyboard, 38% of falsely auto-corrected words are valid in another language.
We present a fast, light-weight and accurate Language Detection Engine (LDE) for multilingual typing.
- Score: 0.415623340386296
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: More than 2 billion mobile users worldwide type in multiple languages in the
soft keyboard. On a monolingual keyboard, 38% of falsely auto-corrected words
are valid in another language. This can be easily avoided by detecting the
language of typed words and then validating it in its respective language.
Language detection is a well-known problem in natural language processing. In
this paper, we present a fast, light-weight and accurate Language Detection
Engine (LDE) for multilingual typing that dynamically adapts to user intended
language in real-time. We propose a novel approach where the fusion of
character N-gram model and logistic regression based selector model is used to
identify the language. Additionally, we present a unique method of reducing the
inference time significantly by parameter reduction technique. We also discuss
various optimizations fabricated across LDE to resolve ambiguity in input text
among the languages with the same character pattern. Our method demonstrates an
average accuracy of 94.5% for Indian languages in Latin script and that of 98%
for European languages on the code-switched data. This model outperforms
fastText by 60.39% and ML-Kit by 23.67% in F1 score for European languages. LDE
is faster on mobile device with an average inference time of 25.91
microseconds.
Related papers
- A two-stage transliteration approach to improve performance of a multilingual ASR [1.9511556030544333]
This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set.
We performed experiments with an end-to-end multilingual speech recognition system for two Indic languages.
arXiv Detail & Related papers (2024-10-09T05:30:33Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Streaming Bilingual End-to-End ASR model using Attention over Multiple
Softmax [6.386371634323785]
We propose a novel bilingual end-to-end (E2E) modeling approach, where a single neural model can recognize both languages.
The proposed model has shared encoder and prediction networks, with language-specific joint networks that are combined via a self-attention mechanism.
arXiv Detail & Related papers (2024-01-22T01:44:42Z) - MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection
Benchmark [10.92793962395538]
MultiTuDE is a novel benchmarking dataset for multilingual machine-generated text detection.
It consists of 74,081 authentic and machine-generated texts in 11 languages.
We compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors.
arXiv Detail & Related papers (2023-10-20T15:57:17Z) - Reducing language context confusion for end-to-end code-switching
automatic speech recognition [50.89821865949395]
We propose a language-related attention mechanism to reduce multilingual context confusion for the E2E code-switching ASR model.
By calculating the respective attention of multiple languages, our method can efficiently transfer language knowledge from rich monolingual data.
arXiv Detail & Related papers (2022-01-28T14:39:29Z) - Handling Compounding in Mobile Keyboard Input [7.309321705635677]
This paper proposes a framework to improve the typing experience of mobile users in morphologically rich languages.
Smartphone keyboards typically support features such as input decoding, corrections and predictions that all rely on language models.
We show that this method brings around 20% word error rate reduction in a variety of compounding languages.
arXiv Detail & Related papers (2022-01-17T15:28:58Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.