Related papers: CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

URL: http://arxiv.org/abs/2507.19904v1
Date: Sat, 26 Jul 2025 10:28:39 GMT
Title: CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation
Authors: Zhanhang Xiong, Dongxia Wang, Yuekang Li, Xinyuan An, Wenhai Wang,
Abstract summary: We present CrossPL, the first benchmark designed to evaluate large language models' (LLMs) ability to generate cross-programming-language (CPL) code.<n>CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques.<n>We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation.
Score: 24.468767564264738
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process communication (IPC). To bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate LLMs' ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an LLM-based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: https://anonymous.4open.science/r/crosspl-2814.

Related papers

Scaling Laws for Code: Every Programming Language Matters [73.6302896079007]
Code large language models (Code LLMs) are powerful but costly to train.<n>Different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance.<n>We present the first systematic exploration of scaling laws for multilingual code pre-training.
arXiv Detail & Related papers (2025-12-15T16:07:34Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages [0.1906498126334485]
This paper evaluates the capabilities of the Llama 2-70B model in automating scientific applications written in programming languages.<n>We assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between programming languages.<n>Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations.
arXiv Detail & Related papers (2025-03-24T23:46:14Z)
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency.<n>These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards.<n>We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z)
Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis [14.458529723566379]
Large language models (LLMs) can be employed for programming languages such as Python and C++.<n>This paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design.
arXiv Detail & Related papers (2025-02-19T17:53:59Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
InterTrans: Leveraging Transitive Intermediate Translations to Enhance LLM-based Code Translation [9.655135415596414]
Code translation aims to convert a program from one programming language (PL) to another. Recent studies show that even advanced techniques such as large language models (LLMs) still struggle with the task. We introduce InterTrans, an LLM-based automated code translation approach.
arXiv Detail & Related papers (2024-11-01T22:31:32Z)
Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors. We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.1875460416205]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.<n>It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.<n>Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
The Struggles of LLMs in Cross-lingual Code Clone Detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction within the software engineering community.<n>Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.<n>We evaluate the performance of five (05) Large Language Models (LLMs) and eight prompts (08) for the identification of cross-lingual code clones.
arXiv Detail & Related papers (2024-08-08T12:57:14Z)
Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation [8.81447711370817]
We empirically analyze the generated outputs of eleven popular instruct-tuned large language models (LLMs) Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output.
arXiv Detail & Related papers (2024-03-25T21:41:31Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.