A Clustering Framework for Lexical Normalization of Roman Urdu
- URL: http://arxiv.org/abs/2004.00088v1
- Date: Tue, 31 Mar 2020 20:21:55 GMT
- Title: A Clustering Framework for Lexical Normalization of Roman Urdu
- Authors: Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, and Jia
Xu
- Abstract summary: Roman Urdu is an informal form of the Urdu language written in Roman script.
It lacks standard spelling and hence poses several normalization challenges during automatic language processing.
We present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora.
- Score: 10.746384310607157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Roman Urdu is an informal form of the Urdu language written in Roman script,
which is widely used in South Asia for online textual content. It lacks
standard spelling and hence poses several normalization challenges during
automatic language processing. In this article, we present a feature-based
clustering framework for the lexical normalization of Roman Urdu corpora, which
includes a phonetic algorithm UrduPhone, a string matching component, a
feature-based similarity function, and a clustering algorithm Lex-Var.
UrduPhone encodes Roman Urdu strings to their pronunciation-based
representations. The string matching component handles character-level
variations that occur when writing Urdu using Roman script.
Related papers
- ERUPD -- English to Roman Urdu Parallel Dataset [0.0]
Roman Urdu is a Latin-script adaptation of Urdu widely used in digital communication.
This study creates a novel parallel dataset comprising 75,146 sentence pairs.
arXiv Detail & Related papers (2024-12-23T13:33:09Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Unicode Normalization and Grapheme Parsing of Indic Languages [2.974799610163104]
Writing systems of Indic languages have orthographic syllables, also known as complex graphemes, as unique horizontal units.
Our proposed normalizer is a more efficient and effective tool than the previously used Indic normalizer.
We report the pipeline for the scripts of 7 languages in this work and develop the framework for the integration of more scripts.
arXiv Detail & Related papers (2023-05-11T14:34:08Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - CALText: Contextual Attention Localization for Offline Handwritten Text [1.066048003460524]
We present an attention based encoder-decoder model that learns to read Urdu in context.
A novel localization penalty is introduced to encourage the model to attend only one location at a time when recognizing the next character.
We evaluate the model on both Urdu and Arabic datasets and show that contextual attention localization outperforms both simple attention and multi-directional LSTM models.
arXiv Detail & Related papers (2021-11-06T19:54:21Z) - Context based Roman-Urdu to Urdu Script Transliteration System [0.0]
The objective of this work is to improve the context base transliteration of Roman-Urdu to Urdu script.
The algorithm work like, convert the encoding roman words into the words in the standard Urdu script and match it with the lexicon.
arXiv Detail & Related papers (2021-09-29T05:24:55Z) - Processing South Asian Languages Written in the Latin Script: the
Dakshina Dataset [9.478817207385472]
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages.
The dataset includes, for each language: 1) native script Wikipedia text; 2) a romanization lexicon; and 3) full sentence parallel data in both a native script of the language and the basic Latin alphabet.
arXiv Detail & Related papers (2020-07-02T14:57:28Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.