Leveraging Language Identification to Enhance Code-Mixed Text
Classification
- URL: http://arxiv.org/abs/2306.04964v1
- Date: Thu, 8 Jun 2023 06:43:10 GMT
- Title: Leveraging Language Identification to Enhance Code-Mixed Text
Classification
- Authors: Gauri Takawane, Abhishek Phaltankar, Varad Patwardhan, Aryan Patil,
Raviraj Joshi, Mukta S. Takalikar
- Abstract summary: Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
- Score: 0.7340017786387767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The usage of more than one language in the same text is referred to as Code
Mixed. It is evident that there is a growing degree of adaption of the use of
code-mixed data, especially English with a regional language, on social media
platforms. Existing deep-learning models do not take advantage of the implicit
language information in the code-mixed text. Our study aims to improve
BERT-based models performance on low-resource Code-Mixed Hindi-English Datasets
by experimenting with language augmentation approaches. We propose a pipeline
to improve code-mixed systems that comprise data preprocessing, word-level
language identification, language augmentation, and model training on
downstream tasks like sentiment analysis. For language augmentation in BERT
models, we explore word-level interleaving and post-sentence placement of
language information. We have examined the performance of vanilla BERT-based
models and their code-mixed HingBERT counterparts on respective benchmark
datasets, comparing their results with and without using word-level language
information. The models were evaluated using metrics such as accuracy,
precision, recall, and F1 score. Our findings show that the proposed language
augmentation approaches work well across different BERT models. We demonstrate
the importance of augmenting code-mixed text with language information on five
different code-mixed Hindi-English downstream datasets based on sentiment
analysis, hate speech detection, and emotion detection.
Related papers
- Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data [0.7874708385247353]
"Code Mixed" refers to the use of more than one language in the same text.
In this work, we focus on low-resource Hindi-English code-mixed language.
We report state-of-the-art results on respective datasets using HingBERT-based models.
arXiv Detail & Related papers (2023-05-25T05:10:28Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic
Patterns [0.5560631344057825]
We propose several Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks.
Our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy.
We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset.
arXiv Detail & Related papers (2022-11-14T18:50:16Z) - Language Identification of Hindi-English tweets using code-mixed BERT [0.0]
The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification.
The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
arXiv Detail & Related papers (2021-07-02T17:51:36Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Sentiment Analysis of Persian-English Code-mixed Texts [0.0]
Due to the unstructured nature of social media data, we are observing more instances of multilingual and code-mixed texts.
In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets.
We introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets.
arXiv Detail & Related papers (2021-02-25T06:05:59Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.