Related papers: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation

RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation

URL: http://arxiv.org/abs/2412.17744v1
Date: Mon, 23 Dec 2024 17:52:10 GMT
Title: RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation
Authors: Yanli Wang, Yanlin Wang, Suiquan Wang, Daya Guo, Jiachi Chen, John Grundy, Xilin Liu, Yuchi Ma, Mingzhi Mao, Hongyu Zhang, Zibin Zheng,
Abstract summary: Repository-level code translation refers to translating an entire code repository from one programming language to another.<n>Previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation.<n>We propose RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.
Score: 44.856816446807265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Repository-level code translation refers to translating an entire code repository from one programming language to another while preserving the functionality of the source repository. Many benchmarks have been proposed to evaluate the performance of such code translators. However, previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation. Such benchmarks do not accurately reflect real-world demands, where entire repositories often need to be translated, involving longer code length and more complex functionalities. To address this gap, we propose a new benchmark, named RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite. We conduct experiments on RepoTransBench to evaluate the translation performance of 11 advanced LLMs. We find that the Success@1 score (test success in one attempt) of the best-performing LLM is only 7.33%. To further explore the potential of LLMs for repository-level code translation, we provide LLMs with error-related feedback to perform iterative debugging and observe an average 7.09% improvement on Success@1. However, even with this improvement, the Success@1 score of the best-performing LLM is only 21%, which may not meet the need for reliable automatic repository-level code translation. Finally, we conduct a detailed error analysis and highlight current LLMs' deficiencies in repository-level code translation, which could provide a reference for further improvements.

Related papers

SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories [8.39619253014789]
SecRepoBench is a benchmark to evaluate LLMs on secure code generation in real-world repositories. We evaluate 19 state-of-the-art LLMs using our benchmark and find that the models struggle with generating correct and secure code.
arXiv Detail & Related papers (2025-04-29T22:22:44Z)
Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-Augmented [25.812942624520694]
Large language models (LLMs) have behaved well in function-level code translation without repository-level context. We propose K-Trans, which leverages triple knowledge augmentation to enhance LLM's translation quality under repository context. Experiments show that K-Trans substantially outperforms the baseline adapted from previous work by 19.4%/40.2% relative improvement in pass@1 and 0.138 in CodeBLEU.
arXiv Detail & Related papers (2025-03-24T03:10:34Z)
ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation [37.34003516231121]
Code translation is a crucial activity in the software development and maintenance process. Existing large language models (LLMs) only learn the contextual semantics of code during pre-training. We propose ExeCoder, an LLM specifically designed for code translation.
arXiv Detail & Related papers (2025-01-30T16:18:52Z)
Repository-level Code Translation Benchmark Targeting Rust [28.25765853736366]
We introduce first repository-level code translation benchmark comprising 375 tasks targeting Rust. Using this benchmark, we study four state-of-the-art large language models (LLMs) Our findings reveal that LLMs exhibit substantially worse performance (41.5%-56.2% Pass@1 drop of GPT-4) on repository-level translations compared to simpler tasks.
arXiv Detail & Related papers (2024-11-21T10:00:52Z)
Towards Translating Real-World Code with LLMs: A Study of Translating to Rust [13.743967357458287]
Large language models (LLMs) show promise in code translation due to their ability to write code in most programming languages. We conduct our study on code extracted from real-world open source projects. FLOURINE is an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program.
arXiv Detail & Related papers (2024-05-19T10:54:03Z)
Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages. Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs. In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z)
AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.