HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text
- URL: http://arxiv.org/abs/2107.03760v1
- Date: Thu, 8 Jul 2021 11:11:37 GMT
- Title: HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text
- Authors: Vivek Srivastava, Mayank Singh
- Abstract summary: We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
- Score: 1.6675267471157407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text generation is a highly active area of research in the computational
linguistic community. The evaluation of the generated text is a challenging
task and multiple theories and metrics have been proposed over the years.
Unfortunately, text generation and evaluation are relatively understudied due
to the scarcity of high-quality resources in code-mixed languages where the
words and phrases from multiple languages are mixed in a single utterance of
text and speech. To address this challenge, we present a corpus (HinGE) for a
widely popular code-mixed language Hinglish (code-mixing of Hindi and English
languages). HinGE has Hinglish sentences generated by humans as well as two
rule-based algorithms corresponding to the parallel Hindi-English sentences. In
addition, we demonstrate the inefficacy of widely-used evaluation metrics on
the code-mixed data. The HinGE dataset will facilitate the progress of natural
language generation research in code-mixed languages.
Related papers
- Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
Text Using Transformers [1.181206257787103]
This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
arXiv Detail & Related papers (2022-06-17T10:36:50Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - A Survey on Retrieval-Augmented Text Generation [53.04991859796971]
Retrieval-augmented text generation has remarkable advantages and has achieved state-of-the-art performance in many NLP tasks.
It firstly highlights the generic paradigm of retrieval-augmented generation, and then it reviews notable approaches according to different tasks.
arXiv Detail & Related papers (2022-02-02T16:18:41Z) - MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG
Evaluation [1.2559148369195197]
Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text.
Various widely popular metrics perform poorly with the code-mixed NLG tasks.
We present a metric independent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments.
arXiv Detail & Related papers (2021-07-24T05:24:26Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.