Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data
- URL: http://arxiv.org/abs/2305.15722v2
- Date: Fri, 26 May 2023 06:51:22 GMT
- Title: Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data
- Authors: Aryan Patil, Varad Patwardhan, Abhishek Phaltankar, Gauri Takawane,
Raviraj Joshi
- Abstract summary: "Code Mixed" refers to the use of more than one language in the same text.
In this work, we focus on low-resource Hindi-English code-mixed language.
We report state-of-the-art results on respective datasets using HingBERT-based models.
- Score: 0.7874708385247353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The term "Code Mixed" refers to the use of more than one language in the same
text. This phenomenon is predominantly observed on social media platforms, with
an increasing amount of adaptation as time goes on. It is critical to detect
foreign elements in a language and process them correctly, as a considerable
number of individuals are using code-mixed languages that could not be
comprehended by understanding one of those languages. In this work, we focus on
low-resource Hindi-English code-mixed language and enhancing the performance of
different code-mixed natural language processing tasks such as sentiment
analysis, emotion recognition, and hate speech identification. We perform a
comparative analysis of different Transformer-based language Models pre-trained
using unsupervised approaches. We have included the code-mixed models like
HingBERT, HingRoBERTa, HingRoBERTa-Mixed, mBERT, and non-code-mixed models like
AlBERT, BERT, and RoBERTa for comparative analysis of code-mixed Hindi-English
downstream tasks. We report state-of-the-art results on respective datasets
using HingBERT-based models which are specifically pre-trained on real
code-mixed text. Our HingBERT-based models provide significant improvements
thus highlighting the poor performance of vanilla BERT models on code-mixed
text.
Related papers
- From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - Language Identification of Hindi-English tweets using code-mixed BERT [0.0]
The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification.
The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
arXiv Detail & Related papers (2021-07-02T17:51:36Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.