BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
  Text Using Transformers
        - URL: http://arxiv.org/abs/2206.08680v1
 - Date: Fri, 17 Jun 2022 10:36:50 GMT
 - Title: BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish
  Text Using Transformers
 - Authors: Shaz Furniturewala, Vijay Kumari, Amulya Ratna Dash, Hriday Kedia,
  Yashvardhan Sharma
 - Abstract summary: This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system.
For the HinglishEval task, the proposed model uses multi-lingual BERT to find the similarity between synthetically generated and human-generated sentences.
 - Score: 1.181206257787103
 - License: http://creativecommons.org/licenses/by-sa/4.0/
 - Abstract:   Code-Mixed text data consists of sentences having words or phrases from more
than one language. Most multi-lingual communities worldwide communicate using
multiple languages, with English usually one of them. Hinglish is a Code-Mixed
text composed of Hindi and English but written in Roman script. This paper aims
to determine the factors influencing the quality of Code-Mixed text data
generated by the system. For the HinglishEval task, the proposed model uses
multi-lingual BERT to find the similarity between synthetically generated and
human-generated sentences to predict the quality of synthetically generated
Hinglish sentences.
 
       
      
        Related papers
        - Breaking the Script Barrier in Multilingual Pre-Trained Language Models   with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv  Detail & Related papers  (2024-06-28T08:59:24Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text   Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
 COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv  Detail & Related papers  (2024-06-16T16:10:51Z) - Marathi-English Code-mixed Text Generation [0.0]
Code-mixing, the blending of linguistic elements from distinct languages to form meaningful sentences, is common in multilingual settings.
This research introduces a Marathi-English code-mixed text generation algorithm, assessed with Code Mixing Index (CMI) and Degree of Code Mixing (DCM) metrics.
arXiv  Detail & Related papers  (2023-09-28T06:51:26Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
  Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv  Detail & Related papers  (2023-03-23T18:16:30Z) - Transformer-based Model for Word Level Language Identification in
  Code-mixed Kannada-English Texts [55.41644538483948]
We propose the use of a Transformer based model for word-level language identification in code-mixed Kannada English texts.
The proposed model on the CoLI-Kenglish dataset achieves a weighted F1-score of 0.84 and a macro F1-score of 0.61.
arXiv  Detail & Related papers  (2022-11-26T02:39:19Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
  Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv  Detail & Related papers  (2022-06-16T08:00:42Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
  Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv  Detail & Related papers  (2021-07-08T11:11:37Z) - NUIG-Shubhanker@Dravidian-CodeMix-FIRE2020: Sentiment Analysis of
  Code-Mixed Dravidian text using XLNet [0.0]
Social media has penetrated into multilingual societies, however most of them use English to be a preferred language for communication.
It looks natural for them to mix their cultural language with English during conversations resulting in abundance of multilingual data, call this code-mixed data, available in todays' world.
Downstream NLP tasks using such data is challenging due to the semantic nature of it being spread across multiple languages.
This paper uses an auto-regressive XLNet model to perform sentiment analysis on code-mixed Tamil-English and Malayalam-English datasets.
arXiv  Detail & Related papers  (2020-10-15T14:09:02Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
  Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv  Detail & Related papers  (2020-09-10T22:42:15Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv  Detail & Related papers  (2020-05-06T04:46:11Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.