Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study
- URL: http://arxiv.org/abs/2310.16937v3
- Date: Tue, 10 Jun 2025 14:10:33 GMT
- Title: Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study
- Authors: Razan Baltaji, Saurabh Pujar, Louis Mandel, Martin Hirzel, Luca Buratti, Lav Varshney,
- Abstract summary: Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks.<n>critical languages, such as Rust and Swift, remain low-resource due to limited openly available code.<n>We develop a performance prediction model to guess the best source languages for a given target and task.
- Score: 5.350495525141013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks, including error detection, clone detection, and code translation, primarily leveraging high-resource programming languages like Python and Java. However, many critical languages, such as COBOL, as well as emerging languages, such as Rust and Swift, remain low-resource due to limited openly available code. This scarcity hampers the training and effectiveness of LLMs for these languages, increasing software maintenance costs and stifling innovation. Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 10 to 41 programming languages and five key tasks: code generation, clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to guess the best source languages for a given target and task, and analyze the features that influence transfer performance. We further replicate a representative subset of experiments with a larger model to test the generalizability of our conclusions to contemporary large-scale LLMs. Our findings demonstrate that cross-lingual transfer significantly outperforms zero-shot learning, with effectiveness varying based on both source and target languages. Furthermore, our model reliably predicts successful transfer sources by considering linguistic and dataset-specific features, offering practical guidance for data acquisition and model training. This work contributes to the development of LLM-driven tools for low-resource programming languages and provides insights into the characteristics that facilitate transfer across language pairs.
Related papers
- Enhancing Multilingual Language Models for Code-Switched Input Data [0.0]
This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks.<n>We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model.<n>Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging.
arXiv Detail & Related papers (2025-03-11T02:49:41Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - The Rise and Down of Babel Tower: Investigating the Evolution Process of Multilingual Code Large Language Model [59.357993924917]
We study the evolution of multilingual capabilities in large language models (LLMs) during the pre-training process.<n>We propose the Babel Tower Hypothesis, which describes the entire process of LLMs acquiring new language capabilities.<n>We propose a novel method to construct an optimized pre-training corpus for multilingual code LLMs.
arXiv Detail & Related papers (2024-12-10T08:28:57Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism.
LLMs initially understand the query, converting multilingual inputs into English for task-solving.
In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity
Recognition [0.10641561702689348]
We investigate the properties of cross-lingual transfer learning between ten low-resourced languages.
We find that models that perform well on a single language often do so at the expense of generalising to others.
The amount of data overlap between the source and target datasets is a better predictor of transfer performance than either the geographical or genetic distance between the languages.
arXiv Detail & Related papers (2023-09-11T08:56:47Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Adapters for Enhanced Modeling of Multilingual Knowledge and Text [54.02078328453149]
Language models have been extended to multilingual language models (MLLMs)
Knowledge graphs contain facts in an explicit triple format, which require careful curation and are only available in a few high-resource languages.
We propose to enhance MLLMs with knowledge from multilingual knowledge graphs (MLKGs) so as to tackle language and knowledge graph tasks across many languages.
arXiv Detail & Related papers (2022-10-24T21:33:42Z) - Language Chameleon: Transformation analysis between languages using
Cross-lingual Post-training based on Pre-trained language models [4.731313022026271]
In this study, we focus on a single low-resource language and perform extensive evaluation and probing experiments using cross-lingual post-training (XPT)
Results show that XPT not only outperforms or performs on par with monolingual models trained with orders of magnitudes more data but also is highly efficient in the transfer process.
arXiv Detail & Related papers (2022-09-14T05:20:52Z) - MetaTPTrans: A Meta Learning Approach for Multilingual Code
Representation Learning [5.434698132994918]
We propose MetaTPTrans, a meta learning approach for multilingual code representation learning.
We show that MetaTPTrans improves the F1 score of state-of-the-art approaches significantly.
arXiv Detail & Related papers (2022-06-13T20:36:42Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - Zero-Shot Cross-Lingual Transfer with Meta Learning [45.29398184889296]
We consider the setting of training models on multiple languages at the same time, when little or no data is available for languages other than English.
We show that this challenging setup can be approached using meta-learning.
We experiment using standard supervised, zero-shot cross-lingual, as well as few-shot cross-lingual settings for different natural language understanding tasks.
arXiv Detail & Related papers (2020-03-05T16:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.