Persona-aware Generative Model for Code-mixed Language
- URL: http://arxiv.org/abs/2309.02915v1
- Date: Wed, 6 Sep 2023 11:20:41 GMT
- Title: Persona-aware Generative Model for Code-mixed Language
- Authors: Ayan Sengupta, Md Shad Akhtar, Tanmoy Chakraborty
- Abstract summary: We make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals.
We propose a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data.
PARADOX achieves 1.6 points better CM BLEU, 47% better perplexity and 32% better semantic coherence than the non-persona-based counterparts.
- Score: 39.14128923434994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-mixing and script-mixing are prevalent across online social networks and
multilingual societies. However, a user's preference toward code-mixing depends
on the socioeconomic status, demographics of the user, and the local context,
which existing generative models mostly ignore while generating code-mixed
texts. In this work, we make a pioneering attempt to develop a persona-aware
generative model to generate texts resembling real-life code-mixed texts of
individuals. We propose a Persona-aware Generative Model for Code-mixed
Generation, PARADOX, a novel Transformer-based encoder-decoder model that
encodes an utterance conditioned on a user's persona and generates code-mixed
texts without monolingual reference data. We propose an alignment module that
re-calibrates the generated sequence to resemble real-life code-mixed texts.
PARADOX generates code-mixed texts that are semantically more meaningful and
linguistically more valid. To evaluate the personification capabilities of
PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM
KS. On average, PARADOX achieves 1.6 points better CM BLEU, 47% better
perplexity and 32% better semantic coherence than the non-persona-based
counterparts.
Related papers
- Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences [18.53327811304381]
Modelling human judgements for the acceptability of code-mixed text can help in distinguishing natural code-mixed text.
Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources.
Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned Multilingual Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-05-09T06:40:39Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [59.32609948217718]
We present CodeIP, a new watermarking technique for Large Language Models (LLMs)-based code generation.
CodeIP enables the insertion of multi-bit information while preserving the semantics of the generated code.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language
Model Embeddings To Estimate Code-Mix Quality [18.806186479627335]
We attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
In our submission to HinglishEval, a shared-task collocated with INLG2022, we attempt to build models that impact the quality of synthetically generated code-mix text by predicting ratings for code-mix quality.
arXiv Detail & Related papers (2022-06-16T08:00:42Z) - L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and
BERT Language Models [1.14219428942199]
We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script.
We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark.
arXiv Detail & Related papers (2022-04-18T16:49:59Z) - CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing [44.54537067761167]
We present Codemixed, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community.
The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish.
arXiv Detail & Related papers (2021-06-10T18:49:29Z) - Exploring Text-to-Text Transformers for English to Hinglish Machine
Translation with Synthetic Code-Mixing [19.19256927651015]
We describe models that convert monolingual English text into Hinglish (code-mixed Hindi and English)
Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models.
Our models place first in the overall ranking of the English-Hinglish official shared task.
arXiv Detail & Related papers (2021-05-18T19:50:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.