Development of POS tagger for English-Bengali Code-Mixed data
- URL: http://arxiv.org/abs/2007.14576v1
- Date: Wed, 29 Jul 2020 03:42:07 GMT
- Title: Development of POS tagger for English-Bengali Code-Mixed data
- Authors: Tathagata Raha, Sainik Kumar Mahata, Dipankar Das, Sivaji
Bandyopadhyay
- Abstract summary: We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script.
Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%.
- Score: 14.298803822659934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code-mixed texts are widespread nowadays due to the advent of social media.
Since these texts combine two languages to formulate a sentence, it gives rise
to various research problems related to Natural Language Processing. In this
paper, we try to excavate one such problem, namely, Parts of Speech tagging of
code-mixed texts. We have built a system that can POS tag English-Bengali
code-mixed data where the Bengali words were written in Roman script. Our
approach initially involves the collection and cleaning of English-Bengali
code-mixed tweets. These tweets were used as a development dataset for building
our system. The proposed system is a modular approach that starts by tagging
individual tokens with their respective languages and then passes them to
different POS taggers, designed for different languages (English and Bengali,
in our case). Tags given by the two systems are later joined together and the
final result is then mapped to a universal POS tag set. Our system was checked
using 100 manually POS tagged code-mixed sentences and it returned an accuracy
of 75.29%
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
Text Using Transformers [1.181206257787103]
This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
arXiv Detail & Related papers (2022-06-17T10:36:50Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - Role of Artificial Intelligence in Detection of Hateful Speech for
Hinglish Data on Social Media [1.8899300124593648]
Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world.
Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages.
We propose a methodology for efficient detection of unstructured code-mix Hinglish language.
arXiv Detail & Related papers (2021-05-11T10:02:28Z) - NITS-Hinglish-SentiMix at SemEval-2020 Task 9: Sentiment Analysis For
Code-Mixed Social Media Text Using an Ensemble Model [1.1265248232450553]
This work proposes a system named NITS-Hinglish-SentiMix to viably complete the sentiment analysis of code-mixed Hinglish text.
The proposed framework has recorded an F-Score of 0.617 on the test data.
arXiv Detail & Related papers (2020-07-23T15:45:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.