Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities
- URL: http://arxiv.org/abs/2305.16407v1
- Date: Thu, 25 May 2023 18:18:42 GMT
- Title: Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities
- Authors: Sina Ahmadi and Antonios Anastasopoulos
- Abstract summary: Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script.
Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
- Score: 36.578851892373365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The wide accessibility of social media has provided linguistically
under-represented communities with an extraordinary opportunity to create
content in their native languages. This, however, comes with certain challenges
in script normalization, particularly where the speakers of a language in a
bilingual community rely on another script or orthography to write their native
language. This paper addresses the problem of script normalization for several
such languages that are mainly written in a Perso-Arabic script. Using
synthetic data with various levels of noise and a transformer-based model, we
demonstrate that the problem can be effectively remediated. We conduct a
small-scale evaluation of real data as well. Our experiments indicate that
script normalization is also beneficial to improve the performance of
downstream tasks such as machine translation and language identification.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Share What You Already Know: Cross-Language-Script Transfer and
Alignment for Sentiment Detection in Code-Mixed Data [0.0]
Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts.
Pre-trained multilingual models primarily utilize the data in the native script of the language.
Using the native script for each language can generate better representations of the text owing to the pre-trained knowledge.
arXiv Detail & Related papers (2024-02-07T02:59:18Z) - Cross-Lingual Transfer from Related Languages: Treating Low-Resource
Maltese as Multilingual Code-Switching [9.435669487585917]
We focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script.
We present a novel dataset annotated with word-level etymology.
We show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.
arXiv Detail & Related papers (2024-01-30T11:04:36Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - PALI: A Language Identification Benchmark for Perso-Arabic Scripts [30.99179028187252]
This paper sheds light on the challenges of detecting languages using Perso-Arabic scripts.
We use a set of supervised techniques to classify sentences into their languages.
We also propose a hierarchical model that targets clusters of languages that are more often confused.
arXiv Detail & Related papers (2023-04-03T19:40:14Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.