Related papers: Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data

URL: http://arxiv.org/abs/2502.12924v1
Date: Tue, 18 Feb 2025 15:04:13 GMT
Title: Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data
Authors: Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa,
Abstract summary: Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP)<n>This paper presents a novel methodology to generate CS data using Large Language Models (LLMs)<n>We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS.
Score: 21.240439045909724
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP). Current Large Language Models (LLMs) struggle to interpret and generate code-switched text, primarily due to the scarcity of large-scale CS datasets for training. This paper presents a novel methodology to generate CS data using LLMs, and test it on the English-Spanish language pair. We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS. Unlike previous approaches to CS generation, our methodology uses natural CS data as a starting point, allowing models to learn its natural distribution beyond grammatical patterns. We thoroughly analyse the models' performance through a study on human preferences, a qualitative error analysis and an evaluation with popular automatic metrics. Results show that our methodology generates fluent code-switched text, expanding research opportunities in CS communication, and that traditional metrics do not correlate with human judgement when assessing the quality of the generated CS data. We release our code and generated dataset under a CC-BY-NC-SA license.

Related papers

Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages [3.263178944046948]
Code-switching (CS) presents challenges for ASR due to scarce and costly transcribed data.<n>We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns.
arXiv Detail & Related papers (2025-06-17T04:37:16Z)
Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization [1.1368382184602488]
The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses.<n>Existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse linguistic contexts.<n>We propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
arXiv Detail & Related papers (2024-12-19T23:22:11Z)
Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models [16.82812708514889]
Code-switching, the phenomenon of alternating between two or more languages in a single conversation, presents unique challenges for Natural Language Processing (NLP) Most existing research focuses on either syntactic constraints or neural generation, with few efforts to integrate linguistic theory with large language models (LLMs) for generating natural code-switched text. We introduce EZSwitch, a novel framework that combines Equivalence Constraint Theory (ECT) with LLMs to produce linguistically valid and fluent code-switched text.
arXiv Detail & Related papers (2024-10-30T03:03:32Z)
LLM-based Code-Switched Text Generation for Grammatical Error Correction [3.4457319208816224]
This work explores the complexities of applying Grammatical Error Correction systems to code-switching (CSW) texts. We evaluate state-of-the-art GEC systems on an authentic CSW dataset from English as a Second Language learners. We develop a model capable of correcting grammatical errors in monolingual and CSW texts.
arXiv Detail & Related papers (2024-10-14T10:07:29Z)
Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text [1.9185059111021852]
We investigate how pre-trained Language Models handle code-switched text in three dimensions. Our findings reveal that pre-trained language models are effective in generalising to code-switched text.
arXiv Detail & Related papers (2024-03-07T19:46:03Z)
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available. We propose performing back-translation via code summarization and generation. We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z)
A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction. Our approach creates diverse parallel GEC data without any language-specific operations. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z)
Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities. We present a novel vantage point of CS to be style variations between both the participating languages. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)
Rnn-transducer with language bias for end-to-end Mandarin-English code-switching speech recognition [58.105818353866354]
We propose an improved recurrent neural network transducer (RNN-T) model with language bias to alleviate the problem. We use the language identities to bias the model to predict the CS points. This promotes the model to learn the language identity information directly from transcription, and no additional LID model is needed.
arXiv Detail & Related papers (2020-02-19T12:01:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.