Language Detection for Transliterated Content
- URL: http://arxiv.org/abs/2401.04619v1
- Date: Tue, 9 Jan 2024 15:40:54 GMT
- Title: Language Detection for Transliterated Content
- Authors: Selva Kumar S, Afifah Khan Mohammed Ajmal Khan, Chirag Manjeshwar,
Imadh Ajaz Banday
- Abstract summary: We study the widespread use of transliteration, where the English alphabet is employed to convey messages in native languages.
This paper addresses this challenge through a dataset of phone text messages in Hindi and Russian transliterated into English.
The research pioneers innovative approaches to identify and convert transliterated text.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the contemporary digital era, the Internet functions as an unparalleled
catalyst, dismantling geographical and linguistic barriers particularly evident
in texting. This evolution facilitates global communication, transcending
physical distances and fostering dynamic cultural exchange. A notable trend is
the widespread use of transliteration, where the English alphabet is employed
to convey messages in native languages, posing a unique challenge for language
technology in accurately detecting the source language. This paper addresses
this challenge through a dataset of phone text messages in Hindi and Russian
transliterated into English utilizing BERT for language classification and
Google Translate API for transliteration conversion. The research pioneers
innovative approaches to identify and convert transliterated text, navigating
challenges in the diverse linguistic landscape of digital communication.
Emphasizing the pivotal role of comprehensive datasets for training Large
Language Models LLMs like BERT, our model showcases exceptional proficiency in
accurately identifying and classifying languages from transliterated text. With
a validation accuracy of 99% our models robust performance underscores its
reliability. The comprehensive exploration of transliteration dynamics
supported by innovative approaches and cutting edge technologies like BERT,
positions our research at the forefront of addressing unique challenges in the
linguistic landscape of digital communication. Beyond contributing to language
identification and transliteration capabilities this work holds promise for
applications in content moderation, analytics and fostering a globally
connected community engaged in meaningful dialogue.
Related papers
- Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features [18.76505158652759]
We propose to exploit both semantic and linguistic features between multiple languages to enhance multilingual translation.
On the encoder side, we introduce a disentangling learning task that aligns encoder representations by disentangling semantic and linguistic features.
On the decoder side, we leverage a linguistic encoder to integrate low-level linguistic features to assist in the target language generation.
arXiv Detail & Related papers (2024-08-02T17:10:12Z) - Tamil Language Computing: the Present and the Future [0.0]
Language computing integrates linguistics, computer science, and cognitive psychology to create meaningful human-computer interactions.
Recent advancements in deep learning have made computers more accessible and capable of independent learning and adaptation.
The paper underscores the importance of building practical applications for languages like Tamil to address everyday communication needs.
arXiv Detail & Related papers (2024-07-11T15:56:02Z) - A Roadmap for Multilingual, Multimodal Domain Independent Deception Detection [2.1506382989223782]
Deception, a prevalent aspect of human communication, has undergone a significant transformation in the digital age.
Recent studies have shown the possibility of the existence of universal linguistic cues to deception across domains within the English language.
The practical task of deception detection in low-resource languages is not a well-studied problem due to the lack of labeled data.
arXiv Detail & Related papers (2024-05-07T00:38:34Z) - We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text [8.956635443376527]
We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text.
We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of language models.
Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful.
arXiv Detail & Related papers (2024-04-10T18:56:53Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Discovering Phonetic Inventories with Crosslingual Automatic Speech
Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language.
We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Detect Language of Transliterated Texts [0.0]
Informal transliteration from other languages to English is prevalent in social media threads, instant messaging, and discussion forums.
We propose a Language Identification (LID) system, with an approach for feature extraction.
We tokenize the words into phonetic syllables and use a simple Long Short-term Memory (LSTM) network architecture to detect the language of transliterated texts.
arXiv Detail & Related papers (2020-04-26T10:28:02Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.