PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality
- URL: http://arxiv.org/abs/2206.07988v1
- Date: Thu, 16 Jun 2022 08:00:42 GMT
- Title: PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality
- Authors: Prashant Kodali, Tanmay Sachan, Akshay Goindani, Anmol Goel, Naman
Ahuja, Manish Shrivastava, Ponnurangam Kumaraguru
- Abstract summary: We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
- Score: 18.806186479627335
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Code-Mixing is a phenomenon of mixing two or more languages in a speech event
and is prevalent in multilingual societies. Given the low-resource nature of
Code-Mixing, machine generation of code-mixed text is a prevalent approach for
data augmentation. However, evaluating the quality of such machine generated
code-mixed text is an open problem. In our submission to HinglishEval, a
shared-task collocated with INLG2022, we attempt to build models factors that
impact the quality of synthetically generated code-mix text by predicting
ratings for code-mix quality.
Related papers
- Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Persona-aware Generative Model for Code-mixed Language [34.826316146894364]
We make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals.
We propose a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data.
PARADOX achieves 1.6 points better CM BLEU, 47% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
arXiv Detail & Related papers (2023-09-06T11:20:41Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv Detail & Related papers (2022-11-26T02:39:19Z) - BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
Text Using Transformers [1.181206257787103]
This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
arXiv Detail & Related papers (2022-06-17T10:36:50Z) - Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning [72.9458277424712]
Mixture Model Auto-Encoders (MixMate) is a novel architecture that clusters data by performing inference on a generative model.
We show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms.
arXiv Detail & Related papers (2021-10-10T02:30:31Z) - Quality Evaluation of the Low-Resource Synthetically Generated
Code-Mixed Hinglish Text [1.6675267471157407]
We synthetically generate code-mixed Hinglish sentences using two distinct approaches.
We employ human annotators to rate the generation quality.
arXiv Detail & Related papers (2021-08-04T06:02:46Z) - Challenges and Limitations with the Metrics Measuring the Complexity of
Code-Mixed Text [1.6675267471157407]
Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech.
This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.
arXiv Detail & Related papers (2021-06-18T13:26:48Z) - CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing [44.54537067761167]
We present Codemixed, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community.
The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish.
arXiv Detail & Related papers (2021-06-10T18:49:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.