Related papers: Evaluating Programming Language Confusion

Evaluating Programming Language Confusion

URL: http://arxiv.org/abs/2503.13620v1
Date: Mon, 17 Mar 2025 18:14:15 GMT
Title: Evaluating Programming Language Confusion
Authors: Micheline Bénédicte Moumoula, Abdoul Kader Kabore, Jacques Klein, Tegawendé F. Bissyande,
Abstract summary: Large Language Models for code (Code LLMs) have gained significant traction in software engineering.<n>These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages.<n>Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages.
Score: 6.462594894731934
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models for code (Code LLMs) have gained significant traction in software engineering, achieving state-of-the-art performance on various programming tasks including code completion, generation, repair, and translation. These models have demonstrated remarkable capabilities in understanding programming concepts, implementing algorithms, and even bridging different programming languages, fundamentally transforming how developers interact with coding environments. Despite these advances, Code LLMs often struggle with programming language confusion--producing code in unintended languages despite explicit instructions or obvious context. We systematically evaluate this phenomenon across diverse programming contexts. Our study assesses seven popular general and Code LLMs across multiple natural and programming languages, analyzing their behavior using four datasets (HumanEval, HumanEval-xl, MBPP, TP3) for code generation and one dataset (CodeNet) for code translation. The study results reveal that language confusion occurs across all evaluated models, with StarCoder and CodeLlama exhibiting the highest confusion rates. Even high-performing models fail to maintain language consistency throughout generated solutions, particularly when handling complex algorithmic problems. We identify key factors contributing to this confusion, including syntactic similarities between programming languages and inconsistent prompt formatting. Interestingly, we find evidence suggesting that LLMs consistently exhibit strategic language migration behaviors, prioritizing languages where they can produce more syntactically correct code even when explicitly instructed otherwise. This phenomenon is particularly pronounced in code generation tasks, where models show strong migration patterns toward Python and between syntactically similar language pairs.

Related papers

Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval [7.33924106492889]
Existing code generation benchmarks are designed to study Large Language Models' end-to-end performance.<n>We construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input.<n>Our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort.
arXiv Detail & Related papers (2025-02-26T14:08:17Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Multi-Programming Language Ensemble for Code Generation in Large Language Model [5.882816711878273]
Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. Most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. We propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance.
arXiv Detail & Related papers (2024-09-06T08:31:18Z)
Benchmarking LLM Code Generation for Audio Programming with Visual Dataflow Languages [1.559169421643164]
Node-based programming languages are increasingly popular in media arts coding domains. Using LLM-based code generation to further lower the barrier to creative output is an exciting opportunity. Best strategy for code generation for visual node-based programming languages is still an open question.
arXiv Detail & Related papers (2024-09-01T22:11:23Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models [28.295926947968574]
Large language models (LLMs) have brought a paradigm shift to the field of code generation. We empirically analyze the differences in coding style between the code generated by Code LLMs and the code written by human developers.
arXiv Detail & Related papers (2024-06-29T14:56:11Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages. Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details. We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval [32.60391966381949]
We introduce xCodeEval, the largest executable multilingual multitask benchmark to date. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
arXiv Detail & Related papers (2023-03-06T10:08:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.