MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG
Evaluation
- URL: http://arxiv.org/abs/2107.11534v1
- Date: Sat, 24 Jul 2021 05:24:26 GMT
- Title: MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG
Evaluation
- Authors: Ayush Garg, Sammed S Kagi, Vivek Srivastava, Mayank Singh
- Abstract summary: Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text.
Various widely popular metrics perform poorly with the code-mixed NLG tasks.
We present a metric independent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments.
- Score: 1.2559148369195197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-mixing is a phenomenon of mixing words and phrases from two or more
languages in a single utterance of speech and text. Due to the high linguistic
diversity, code-mixing presents several challenges in evaluating standard
natural language generation (NLG) tasks. Various widely popular metrics perform
poorly with the code-mixed NLG tasks. To address this challenge, we present a
metric independent evaluation pipeline MIPE that significantly improves the
correlation between evaluation metrics and human judgments on the generated
code-mixed text. As a use case, we demonstrate the performance of MIPE on the
machine-generated Hinglish (code-mixing of Hindi and English languages)
sentences from the HinGE corpus. We can extend the proposed evaluation strategy
to other code-mixed language pairs, NLG tasks, and evaluation metrics with
minimal to no effort.
Related papers
- Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences [3.359458926468223]
We introduce GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences.
Game does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators.
We release a dataset containing gold-standard code-mixed sentences across 4 language pairs.
arXiv Detail & Related papers (2024-10-14T14:54:05Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv Detail & Related papers (2023-09-28T06:51:26Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Not All Errors are Equal: Learning Text Generation Metrics using
Stratified Error Synthesis [79.18261352971284]
We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation.
We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings.
SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.
arXiv Detail & Related papers (2022-10-10T22:30:26Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - Challenges and Limitations with the Metrics Measuring the Complexity of
Code-Mixed Text [1.6675267471157407]
Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech.
This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.
arXiv Detail & Related papers (2021-06-18T13:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.