MATra: A Multilingual Attentive Transliteration System for Indian
Scripts
- URL: http://arxiv.org/abs/2208.10801v1
- Date: Tue, 23 Aug 2022 08:14:29 GMT
- Title: MATra: A Multilingual Attentive Transliteration System for Indian
Scripts
- Authors: Yash Raj and Bhavesh Laddagiri
- Abstract summary: This paper shows a model that can perform transliteration between any pair among the following five languages.
The model beats the state-of-the-art (for all pairs among the five mentioned languages) and achieves a top-1 accuracy score of 80.7%.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transliteration is a task in the domain of NLP where the output word is a
similar-sounding word written using the letters of any foreign language. Today
this system has been developed for several language pairs that involve English
as either the source or target word and deployed in several places like Google
Translate and chatbots. However, there is very little research done in the
field of Indic languages transliterated to other Indic languages. This paper
demonstrates a multilingual model based on transformers (with some
modifications) that can give noticeably higher performance and accuracy than
all existing models in this domain and get much better results than
state-of-the-art models. This paper shows a model that can perform
transliteration between any pair among the following five languages - English,
Hindi, Bengali, Kannada and Tamil. It is applicable in scenarios where language
is a barrier to communication in any written task. The model beats the
state-of-the-art (for all pairs among the five mentioned languages - English,
Hindi, Bengali, Kannada, and Tamil) and achieves a top-1 accuracy score of
80.7%, about 29.5% higher than the best current results. Furthermore, the model
achieves 93.5% in terms of Phonetic Accuracy (transliteration is primarily a
phonetic/sound-based task).
Related papers
- INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects [10.663878830823043]
In India, despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented.<n>We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia.
arXiv Detail & Related papers (2026-01-15T13:40:27Z) - ParsTranslit: Truly Versatile Tajik-Farsi Transliteration [6.164342356356261]
As a digraphic language, the Persian language utilizes two written standards: Perso-Arabic in Afghanistan and Iran, and Tajik-Cyrillic in Tajikistan.<n> script differences prevent simple one-to-one mapping, hindering written communication and interaction between Tajikistan and its Persian-speaking siblings''<n>We present a new state-of-the-art sequence-to-sequence model for Tajik-Farsi transliteration trained across all available datasets.
arXiv Detail & Related papers (2025-10-08T20:33:50Z) - ILID: Native Script Language Identification for Indian Languages [0.0]
Core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments.<n>We release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers.<n>Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task.
arXiv Detail & Related papers (2025-07-16T01:39:32Z) - Kinship in Speech: Leveraging Linguistic Relatedness for Zero-Shot TTS in Indian Languages [6.74683227658822]
India has 1369 languages, with 22 official using 13 scripts.<n>Our work focuses on zero-shot synthesis, particularly for languages whose scripts and phonotactics come from different families.<n>Intelligible and natural speech was generated for Sanskrit, Maharashtrian and Canara Konkani, Maithili and Kurukh.
arXiv Detail & Related papers (2025-06-04T12:22:24Z) - Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - A two-stage transliteration approach to improve performance of a multilingual ASR [1.9511556030544333]
This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set.
We performed experiments with an end-to-end multilingual speech recognition system for two Indic languages.
arXiv Detail & Related papers (2024-10-09T05:30:33Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - TransliCo: A Contrastive Learning Framework to Address the Script Barrier in Multilingual Pretrained Language Models [50.40191599304911]
We propose TransliCo to fine-tune an mPLM by contrasting sentences in its training data and their transliterations in a unified script.
We show that Furina outperforms the original Glot500-m on various zero-shot crosslingual transfer tasks.
arXiv Detail & Related papers (2024-01-12T15:12:48Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Investigating Lexical Sharing in Multilingual Machine Translation for
Indian Languages [8.858671209228536]
We investigate lexical sharing in multilingual machine translation from Hindi, Gujarati, Nepali into English.
We find that transliteration does not give pronounced improvements.
Our analysis suggests that our multilingual MT models trained on original scripts seem to already be robust to cross-script differences.
arXiv Detail & Related papers (2023-05-04T23:35:15Z) - Revisiting Language Encoding in Learning Multilingual Representations [70.01772581545103]
We propose a new approach called Cross-lingual Language Projection (XLP) to replace language embedding.
XLP projects the word embeddings into language-specific semantic space, and then the projected embeddings will be fed into the Transformer model.
Experiments show that XLP can freely and significantly boost the model performance on extensive multilingual benchmark datasets.
arXiv Detail & Related papers (2021-02-16T18:47:10Z) - HinFlair: pre-trained contextual string embeddings for pos tagging and
text classification in the Hindi language [0.0]
HinFlair is a language representation model (contextual string embeddings) pre-trained on a large monolingual Hindi corpus.
Results show that HinFlair outperforms previous state-of-the-art publicly available pre-trained embeddings for downstream tasks like text classification and pos tagging.
arXiv Detail & Related papers (2021-01-18T09:23:35Z) - Indic-Transformers: An Analysis of Transformer Language Models for
Indian Languages [0.8155575318208631]
Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks.
However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German.
Indian languages, on the other hand, are underrepresented in such benchmarks.
arXiv Detail & Related papers (2020-11-04T14:43:43Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.