PALI: A Language Identification Benchmark for Perso-Arabic Scripts
- URL: http://arxiv.org/abs/2304.01322v1
- Date: Mon, 3 Apr 2023 19:40:14 GMT
- Title: PALI: A Language Identification Benchmark for Perso-Arabic Scripts
- Authors: Sina Ahmadi and Milind Agarwal and Antonios Anastasopoulos
- Abstract summary: This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts.
We use a set of supervised techniques to classify sentences into their languages.
We also propose a hierarchical model that targets clusters of languages that are more often confused.
- Score: 30.99179028187252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Perso-Arabic scripts are a family of scripts that are widely adopted and
used by various linguistic communities around the globe. Identifying various
languages using such scripts is crucial to language technologies and
challenging in low-resource setups. As such, this paper sheds light on the
challenges of detecting languages using Perso-Arabic scripts, especially in
bilingual communities where ``unconventional'' writing is practiced. To address
this, we use a set of supervised techniques to classify sentences into their
languages. Building on these, we also propose a hierarchical model that targets
clusters of languages that are more often confused by the classifiers. Our
experiment results indicate the effectiveness of our solutions.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts.
We propose learning script-agnostic representations using several different experimental strategies.
We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z) - MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling [70.34758460372629]
We introduce a new paradigm that encodes the same information with segments of consistent size across diverse languages.
MYTE produces shorter encodings for all 99 analyzed languages.
This, in turn, improves multilingual LM performance and diminishes the perplexity gap throughout diverse languages.
arXiv Detail & Related papers (2024-03-15T21:21:11Z) - Share What You Already Know: Cross-Language-Script Transfer and
Alignment for Sentiment Detection in Code-Mixed Data [0.0]
Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts.
Pre-trained multilingual models primarily utilize the data in the native script of the language.
Using the native script for each language can generate better representations of the text owing to the pre-trained knowledge.
arXiv Detail & Related papers (2024-02-07T02:59:18Z) - Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities [36.578851892373365]
Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script.
Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
arXiv Detail & Related papers (2023-05-25T18:18:42Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - Language Lexicons for Hindi-English Multilingual Text Processing [0.0]
The present Language Identification techniques presume that a document contains text in one of the fixed set of languages.
Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons.
These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary.
arXiv Detail & Related papers (2021-06-29T05:42:54Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.