Language Identification of Hindi-English tweets using code-mixed BERT
- URL: http://arxiv.org/abs/2107.01202v1
- Date: Fri, 2 Jul 2021 17:51:36 GMT
- Title: Language Identification of Hindi-English tweets using code-mixed BERT
- Authors: Mohd Zeeshan Ansari, M M Sufyan Beg, Tanvir Ahmad, Mohd Jazib Khan,
Ghazali Wasim
- Abstract summary: The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification.
The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language identification of social media text has been an interesting problem
of study in recent years. Social media messages are predominantly in code mixed
in non-English speaking states. Prior knowledge by pre-training contextual
embeddings have shown state of the art results for a range of downstream tasks.
Recently, models such as BERT have shown that using a large amount of unlabeled
data, the pretrained language models are even more beneficial for learning
common language representations. Extensive experiments exploiting transfer
learning and fine-tuning BERT models to identify language on Twitter are
presented in this paper. The work utilizes a data collection of
Hindi-English-Urdu codemixed text for language pre-training and Hindi-English
codemixed for subsequent word-level language classification. The results show
that the representations pre-trained over codemixed data produce better results
by their monolingual counterpart.
Related papers
- Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - On Importance of Code-Mixed Embeddings for Hate Speech Identification [0.4194295877935868]
We analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models in hate speech detection.
Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3-HingCorpus, outperform BERT models when tested on hate speech text datasets.
arXiv Detail & Related papers (2024-11-27T18:23:57Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Comparative Study of Pre-Trained BERT Models for Code-Mixed
Hindi-English Data [0.7874708385247353]
"Code Mixed" refers to the use of more than one language in the same text.
In this work, we focus on low-resource Hindi-English code-mixed language.
We report state-of-the-art results on respective datasets using HingBERT-based models.
arXiv Detail & Related papers (2023-05-25T05:10:28Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.