A New Benchmark for Evaluating Code Translation with Third-Party Libraries
- URL: http://arxiv.org/abs/2509.12087v1
- Date: Mon, 15 Sep 2025 16:16:14 GMT
- Title: A New Benchmark for Evaluating Code Translation with Third-Party Libraries
- Authors: Pengyu Xue, Kunwu Zheng, Zhen Yang, Yifei Pei, Linhao Wu, Jiahui Dong, Xiapu Luo, Yan Xiao, Fei Liu, Yuxuan Zhang, Xiran Lyu, Xianhang Li, Xuanyu Zhu, Chengyi Wang,
- Abstract summary: TransLibEval is the first benchmark dedicated to library-centric code translation.<n>It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development.<n>We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented.
- Score: 37.53966825335189
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In recent years, Large Language Models (LLMs) have been widely studied in the code translation field on the method, class, and even repository levels. However, most of these benchmarks are limited in terms of Third-Party Library (TPL) categories and scales, making TPL-related errors hard to expose and hindering the development of targeted solutions. Considering the high dependence (over 90%) on TPLs in practical programming, demystifying and analyzing LLMs' code translation performance involving various TPLs becomes imperative. To address this gap, we construct TransLibEval, the first benchmark dedicated to library-centric code translation. It consists of 200 real-world tasks across Python, Java, and C++, each explicitly involving TPLs from diverse categories such as data processing, machine learning, and web development, with comprehensive dependency coverage and high-coverage test suites. We evaluate seven recent LLMs of commercial, general, and code-specialized families under six translation strategies of three categories: Direct, IR-guided, and Retrieval-augmented. Experimental results show a dramatic performance drop compared with library-free settings (average CA decline over 60%), while diverse strategies demonstrate heterogeneous advantages. Furthermore, we analyze 4,831 failed cases from GPT-4o, one of the State-of-the-Art (SOTA) LLMs, revealing numerous third-party reference errors that were obscured previously. These findings highlight the unique challenges of library-centric translation and provide practical guidance for improving TPL-aware code intelligence.
Related papers
- LibContinual: A Comprehensive Library towards Realistic Continual Learning [62.34449396069085]
A fundamental challenge in Continual Learning (CL) is catastrophic forgetting, where adapting to new tasks degrades the performance on previous ones.<n>We propose LibContinual, a comprehensive and reproducible library designed to serve as a foundational platform for realistic CL.
arXiv Detail & Related papers (2025-12-26T13:59:13Z) - Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation [3.9189409002585567]
Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks.<n>We introduce a benchmark derived from real-world open-source repositories to evaluate generalization under practical conditions.<n>We examine how input specification completeness and retrieval-augmented generation affect class-level correctness across multiple state-of-the-art LLMs.
arXiv Detail & Related papers (2025-10-30T04:30:23Z) - MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation [0.7342677574855649]
We introduce textbfMRG-Bench, a novel dataset that provides a more accurate evaluation of large language models.<n>We conduct experiments including large language models, long-context models, and RAG-related methods.<n>Results show that the majority of methods suffer from "textbfdifficulty in understanding user requirements," failing to comprehend their assigned tasks accurately.
arXiv Detail & Related papers (2025-08-05T01:53:45Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - The Foundation Cracks: A Comprehensive Study on Bugs and Testing Practices in LLM Libraries [37.57398329330302]
Large Language Model (LLM) libraries have emerged as the foundational infrastructure powering today's AI revolution.<n>Despite their critical role in the LLM ecosystem, these libraries face frequent quality issues and bugs that threaten the reliability of AI systems built upon them.<n>We present the first comprehensive empirical investigation into bug characteristics and testing practices in modern LLM libraries.
arXiv Detail & Related papers (2025-06-14T03:00:36Z) - ClassEval-T: Evaluating Large Language Models in Class-Level Code Translation [19.69195067838796]
We construct a class-level code translation benchmark, ClassEval-T, and make the first attempt to extensively assess recent LLMs' performance on class-level code translation.<n>It cost us 360 person-hours to accomplish the manual migration to Java and C++ with complete code samples and associated test suites.<n> Experimental results demonstrate a remarkable performance drop compared with the most widely studied method-level code translation benchmark.
arXiv Detail & Related papers (2024-11-09T11:13:14Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Analyzing and Adapting Large Language Models for Few-Shot Multilingual
NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning.
We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups.
Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.