Challenges and Limitations with the Metrics Measuring the Complexity of
Code-Mixed Text
- URL: http://arxiv.org/abs/2106.10123v1
- Date: Fri, 18 Jun 2021 13:26:48 GMT
- Title: Challenges and Limitations with the Metrics Measuring the Complexity of
Code-Mixed Text
- Authors: Vivek Srivastava, Mayank Singh
- Abstract summary: Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech.
This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.
- Score: 1.6675267471157407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-mixing is a frequent communication style among multilingual speakers
where they mix words and phrases from two different languages in the same
utterance of text or speech. Identifying and filtering code-mixed text is a
challenging task due to its co-existence with monolingual and noisy text. Over
the years, several code-mixing metrics have been extensively used to identify
and validate code-mixed text quality. This paper demonstrates several inherent
limitations of code-mixing metrics with examples from the already existing
datasets that are popularly used across various experiments.
Related papers
- RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval [0.0]
In India, social media users frequently engage in code-mixed conversations using the Roman script.
This paper focuses on the challenges of extracting relevant information from code-mixed conversations.
We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
arXiv Detail & Related papers (2024-11-07T14:41:01Z) - Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages.
Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details.
We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z) - MacLaSa: Multi-Aspect Controllable Text Generation via Efficient
Sampling from Compact Latent Space [110.85888003111653]
Multi-aspect controllable text generation aims to generate fluent sentences that possess multiple desired attributes simultaneously.
We introduce a novel approach for multi-aspect control, namely MacLaSa, that estimates compact latent space for multiple aspects.
We show that MacLaSa outperforms several strong baselines on attribute relevance and textual quality while maintaining a high inference speed.
arXiv Detail & Related papers (2023-05-22T07:30:35Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv Detail & Related papers (2022-06-16T08:00:42Z) - MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG
Evaluation [1.2559148369195197]
Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text.
Various widely popular metrics perform poorly with the code-mixed NLG tasks.
We present a metric independent evaluation pipeline MIPE that significantly improves the correlation between evaluation metrics and human judgments.
arXiv Detail & Related papers (2021-07-24T05:24:26Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - IIT Gandhinagar at SemEval-2020 Task 9: Code-Mixed Sentiment
Classification Using Candidate Sentence Generation and Selection [1.2301855531996841]
Code-mixing adds to the challenge of analyzing the sentiment of the text due to the non-standard writing style.
We present a candidate sentence generation and selection based approach on top of the Bi-LSTM based neural classifier.
The proposed approach shows an improvement in the system performance as compared to the Bi-LSTM based neural classifier.
arXiv Detail & Related papers (2020-06-25T14:59:47Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.