Context based Roman-Urdu to Urdu Script Transliteration System
- URL: http://arxiv.org/abs/2109.14197v1
- Date: Wed, 29 Sep 2021 05:24:55 GMT
- Title: Context based Roman-Urdu to Urdu Script Transliteration System
- Authors: H Muhammad Shakeel, Rashid Khan, Muhammad Waheed
- Abstract summary: The objective of this work is to improve the context base transliteration of Roman-Urdu to Urdu script.
The algorithm work like, convert the encoding roman words into the words in the standard Urdu script and match it with the lexicon.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Now a day computer is necessary for human being and it is very useful in many
fields like search engine, text processing, short messaging services, voice
chatting and text recognition. Since last many years there are many tools and
techniques that have been developed to support the writing of language script.
Most of the Asian languages like Arabic, Urdu, Persian, Chains and Korean are
written in Roman alphabets. Roman alphabets are the most commonly used for
transliteration of languages, which have non-Latin scripts. For writing Urdu
characters as an input, there are many layouts which are already exist. Mostly
Urdu speaker prefer to use Roman-Urdu for different applications, because
mostly user is not familiar with Urdu language keyboard. The objective of this
work is to improve the context base transliteration of Roman-Urdu to Urdu
script. In this paper, we propose an algorithm which effectively solve the
transliteration issues. The algorithm work like, convert the encoding roman
words into the words in the standard Urdu script and match it with the lexicon.
If match found, then display the word in the text editor. The highest frequency
words are displayed if more than one match found in the lexicon. Display the
first encoded and converted instance and set it to the default if there is not
a single instance of the match is found and then adjust the given ambiguous
word to their desire location according to their context. The outcome of this
algorithm proved the efficiency and significance as compare to other models and
algorithms which work for transliteration of Raman-Urdu to Urdu on context.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - CALText: Contextual Attention Localization for Offline Handwritten Text [1.066048003460524]
We present an attention based encoder-decoder model that learns to read Urdu in context.
A novel localization penalty is introduced to encourage the model to attend only one location at a time when recognizing the next character.
We evaluate the model on both Urdu and Arabic datasets and show that contextual attention localization outperforms both simple attention and multi-directional LSTM models.
arXiv Detail & Related papers (2021-11-06T19:54:21Z) - Co-occurrences using Fasttext embeddings for word similarity tasks in
Urdu [0.0]
This paper builds a corpus for Urdu by scraping and integrating data from various sources.
We modify fasttext embeddings and N-Grams models to enable training them on our built corpus.
We have used these trained embeddings for a word similarity task and compared the results with existing techniques.
arXiv Detail & Related papers (2021-02-22T12:56:26Z) - Processing South Asian Languages Written in the Latin Script: the
Dakshina Dataset [9.478817207385472]
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet.
arXiv Detail & Related papers (2020-07-02T14:57:28Z) - A Clustering Framework for Lexical Normalization of Roman Urdu [10.746384310607157]
Roman Urdu is an informal form of the Urdu language written in Roman script.
It lacks standard spelling and hence poses several normalization challenges during automatic language processing.
We present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora.
arXiv Detail & Related papers (2020-03-31T20:21:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.