Related papers: Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation

URL: http://arxiv.org/abs/2506.03535v1
Date: Wed, 04 Jun 2025 03:31:00 GMT
Title: Across Programming Language Silos: A Study on Cross-Lingual Retrieval-augmented Code Generation
Authors: Qiming Zhu, Jialun Cao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Shing-Chi Cheung,
Abstract summary: Multi-lingual RACG systems are valuable for migrating code-bases across programming languages.<n>We construct a dataset spanning 13 PLs with nearly 14k instances to explore utility and robustness of multi-lingual RACG systems.
Score: 48.07804537257056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current research on large language models (LLMs) with retrieval-augmented code generation (RACG) mainly focuses on single-language settings, leaving cross-lingual effectiveness and security unexplored. Multi-lingual RACG systems are valuable for migrating code-bases across programming languages (PLs), yet face risks from error (e.g. adversarial data corruption) propagation in cross-lingual transfer. We construct a dataset spanning 13 PLs with nearly 14k instances to explore utility and robustness of multi-lingual RACG systems. Our investigation reveals four key insights: (1) Effectiveness: multi-lingual RACG significantly enhances multi-lingual code LLMs generation; (2) Inequality: Java demonstrate superior cross-lingual utility over Python in RACG; (3) Robustness: Adversarial attacks degrade performance significantly in mono-lingual RACG but show mitigated impacts in cross-lingual scenarios; Counterintuitively, perturbed code may improve RACG in cross-lingual scenarios; (4) Specialization: Domain-specific code retrievers outperform significantly general text retrievers. These findings establish foundation for developing effective and secure multi-lingual code assistants.

Related papers

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora [6.594531626178451]
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages.<n>We study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets.<n>We propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages.
arXiv Detail & Related papers (2025-07-10T08:38:31Z)
SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset [34.40254709148148]
Code-Switching (CS) is the alternating use of two or more languages within a conversation or utterance.<n>This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems.<n>textbfSwitchLingua is the first large-scale multilingual and multi-ethnic code-switching dataset.
arXiv Detail & Related papers (2025-05-30T05:54:46Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety [56.79292318645454]
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking.<n>This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited.<n>We introduce a multilingual guardrail with reasoning for prompt classification.
arXiv Detail & Related papers (2025-04-21T17:15:06Z)
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task [73.35882908048423]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary NLP.<n>This paper investigates the effectiveness of RAG across multiple languages by proposing novel approaches for multilingual open-domain question-answering.
arXiv Detail & Related papers (2025-04-04T17:35:43Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding [10.154013836043816]
Code-switching in red-teaming queries can effectively elicit undesirable behaviors of large language models (LLMs) We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries. We demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques.
arXiv Detail & Related papers (2024-06-17T06:08:18Z)
Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection [19.399281609371258]
Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. We resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection.
arXiv Detail & Related papers (2023-11-03T16:51:07Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.