Persona-aware Generative Model for Code-mixed Language
- URL: http://arxiv.org/abs/2309.02915v2
- Date: Sat, 19 Oct 2024 07:06:30 GMT
- Title: Persona-aware Generative Model for Code-mixed Language
- Authors: Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty,
- Abstract summary: We make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals.
We propose a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data.
PARADOX achieves 1.6 points better CM BLEU, 47% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
- Score: 34.826316146894364
- License:
- Abstract: Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user's preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models mostly ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose a Persona-aware Generative Model for Code-mixed Generation, PARADOX, a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves 1.6 points better CM BLEU, 47% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
Related papers
- Multilingual Controlled Generation And Gold-Standard-Agnostic Evaluation of Code-Mixed Sentences [3.359458926468223]
We introduce GAME: A Gold-Standard Agnostic Measure for Evaluation of Code-Mixed Sentences.
Game does not require gold-standard code-mixed sentences for evaluation, thus eliminating the need for human annotators.
We release a dataset containing gold-standard code-mixed sentences across 4 language pairs.
arXiv Detail & Related papers (2024-10-14T14:54:05Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv Detail & Related papers (2022-06-16T08:00:42Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.