Cross-Lingual Text Classification of Transliterated Hindi and Malayalam
- URL: http://arxiv.org/abs/2108.13620v1
- Date: Tue, 31 Aug 2021 05:13:17 GMT
- Title: Cross-Lingual Text Classification of Transliterated Hindi and Malayalam
- Authors: Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, Huzefa
Rangwala
- Abstract summary: We combine data augmentation approaches with a Teacher-Student training scheme to address this issue.
We evaluate our method on transliterated Hindi and Malayalam, also introducing new datasets for benchmarking on real-world scenarios.
Our method yielded an average improvement of +5.6% on mBERT and +4.7% on XLM-R in F1 scores over their strong baselines.
- Score: 31.86825573676501
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transliteration is very common on social media, but transliterated text is
not adequately handled by modern neural models for various NLP tasks. In this
work, we combine data augmentation approaches with a Teacher-Student training
scheme to address this issue in a cross-lingual transfer setting for
fine-tuning state-of-the-art pre-trained multilingual language models such as
mBERT and XLM-R. We evaluate our method on transliterated Hindi and Malayalam,
also introducing new datasets for benchmarking on real-world scenarios: one on
sentiment classification in transliterated Malayalam, and another on crisis
tweet classification in transliterated Hindi and Malayalam (related to the 2013
North India and 2018 Kerala floods). Our method yielded an average improvement
of +5.6% on mBERT and +4.7% on XLM-R in F1 scores over their strong baselines.
Related papers
- Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages [0.0]
This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages.
In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages.
We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India.
arXiv Detail & Related papers (2024-11-07T09:57:57Z) - Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification [1.174020933567308]
Hawrami, a dialect of Kurdish, is classified as an endangered language.
This paper introduces various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers.
arXiv Detail & Related papers (2024-09-25T12:52:21Z) - Multilingual Text Style Transfer: Datasets & Models for Indian Languages [1.116636487692753]
This paper focuses on sentiment transfer, a popular TST subtask, across a spectrum of Indian languages.
We introduce dedicated datasets of 1,000 positive and 1,000 negative style-parallel sentences for each of these eight languages.
We evaluate the performance of various benchmark models categorized into parallel, non-parallel, cross-lingual, and shared learning approaches.
arXiv Detail & Related papers (2024-05-31T14:05:27Z) - TransMI: A Framework to Create Strong Baselines from Multilingual Pretrained Language Models for Transliterated Data [50.40191599304911]
We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
arXiv Detail & Related papers (2024-05-16T09:08:09Z) - Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer [92.80671770992572]
Cross-lingual transfer is a central task in multilingual NLP.
Earlier efforts on this task use parallel corpora, bilingual dictionaries, or other annotated alignment data.
We propose a simple yet effective method, SALT, to improve the zero-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-09-19T19:30:56Z) - cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media
Comments using Spatio-Temporally Retrained Language Models [0.9012198585960441]
This paper describes our multiclass classification system developed as part of the LTERAN@LP-2023 shared task.
We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions.
We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score.
arXiv Detail & Related papers (2023-08-20T21:30:34Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Data-adaptive Transfer Learning for Translation: A Case Study in Haitian
and Jamaican [4.4096464238164295]
We show that transfer effectiveness is correlated with amount of training data and relationships between languages.
We contribute a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding.
In very low-resource Jamaican MT, code-switching with a transfer language for orthographic resemblance yields a 6.63 BLEU point advantage.
arXiv Detail & Related papers (2022-09-13T20:58:46Z) - Overcoming Catastrophic Forgetting in Zero-Shot Cross-Lingual Generation [48.80125962015044]
We investigate the problem of performing a generative task (i.e., summarization) in a target language when labeled data is only available in English.
We find that parameter-efficient adaptation provides gains over standard fine-tuning when transferring between less-related languages.
Our methods can provide further quality gains, suggesting that robust zero-shot cross-lingual generation is within reach.
arXiv Detail & Related papers (2022-05-25T10:41:34Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.