Repository-level Code Translation Benchmark Targeting Rust
- URL: http://arxiv.org/abs/2411.13990v3
- Date: Tue, 26 Nov 2024 13:21:44 GMT
- Title: Repository-level Code Translation Benchmark Targeting Rust
- Authors: Guangsheng Ou, Mingwei Liu, Yuxuan Chen, Xin Peng, Zibin Zheng,
- Abstract summary: We introduce first repository-level code translation benchmark comprising 375 tasks targeting Rust.
Using this benchmark, we study four state-of-the-art large language models (LLMs)
Our findings reveal that LLMs exhibit substantially worse performance (41.5%-56.2% Pass@1 drop of GPT-4) on repository-level translations compared to simpler tasks.
- Score: 28.25765853736366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large language models (LLMs) have shown significant capabilities in code translation, often evaluated using benchmarks like CodeTransOcean. However, these evaluations typically focus on simple, function-level translations without considering dependencies, which does not reflect the complexities of real-world software development. Further, their effectiveness in translating to newer, lower-resource languages like Rust in realistic scenarios is still under-explored. To address this gap, we introduce first repository-level code translation benchmark comprising 375 tasks targeting Rust, complete with relevant dependencies. Using this benchmark, we study four state-of-the-art LLMs, analyzing their erroneous outputs to understand their performance in more complex translation scenarios. Our findings reveal that LLMs exhibit substantially worse performance (41.5%-56.2% Pass@1 drop of GPT-4) on repository-level translations compared to simpler tasks, highlighting limitations in existing evaluation methods. The model that performed the best is Claude-3.5, demonstrating the strongest translation capabilities in both basic functionality accuracy and several relevant additional abilities. Additionally, we discover that LLMs struggle with identifying language differences in complex tasks, and that increased dependencies correlate with greater translation difficulty.
Related papers
- PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.
Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators [21.114491141763647]
Rewriting C code in Rust provides stronger memory safety, yet migrating larges such as the 32-million-line Linux kernel remains challenging.
Recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness"
LLM-based C-to-Rust translation tool splits modules into discrete functions, translating them individually, and then reintegrating them.
arXiv Detail & Related papers (2025-03-31T07:09:07Z) - Enhancing LLM-based Code Translation in Repository Context via Triple Knowledge-Augmented [25.812942624520694]
Large language models (LLMs) have behaved well in function-level code translation without repository-level context.
We propose K-Trans, which leverages triple knowledge augmentation to enhance LLM's translation quality under repository context.
Experiments show that K-Trans substantially outperforms the baseline adapted from previous work by 19.4%/40.2% relative improvement in pass@1 and 0.138 in CodeBLEU.
arXiv Detail & Related papers (2025-03-24T03:10:34Z) - LLM-Driven Multi-step Translation from C to Rust using Static Analysis [27.122409727034192]
Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety.
We propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology.
SACTOR produces more natural and Rust-compliant translations compared to existing methods.
arXiv Detail & Related papers (2025-03-16T14:05:26Z) - Lost in Literalism: How Supervised Training Shapes Translationese in LLMs [51.04435855143767]
Large language models (LLMs) have achieved remarkable success in machine translation.
However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge.
We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.
arXiv Detail & Related papers (2025-03-06T12:14:45Z) - RepoTransBench: A Real-World Benchmark for Repository-Level Code Translation [44.856816446807265]
Repository-level code translation refers to translating an entire code repository from one programming language to another.
Previous benchmarks mostly provide fine-grained samples, focusing at either code snippet, function, or file-level code translation.
We propose RepoTransBench, which is a real-world repository-level code translation benchmark with an automatically executable test suite.
arXiv Detail & Related papers (2024-12-23T17:52:10Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - AAVENUE: Detecting LLM Biases on NLU Tasks in AAVE via a Novel Benchmark [3.1927733045184885]
AAVENUE is a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English.
We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability.
Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases.
arXiv Detail & Related papers (2024-08-27T07:56:35Z) - TasTe: Teaching Large Language Models to Translate through Self-Reflection [82.83958470745381]
Large language models (LLMs) have exhibited remarkable performance in various natural language processing tasks.
We propose the TasTe framework, which stands for translating through self-reflection.
The evaluation results in four language directions on the WMT22 benchmark reveal the effectiveness of our approach compared to existing methods.
arXiv Detail & Related papers (2024-06-12T17:21:21Z) - Towards Translating Real-World Code with LLMs: A Study of Translating to Rust [13.743967357458287]
Large language models (LLMs) show promise in code translation due to their ability to write code in most programming languages.
We conduct our study on code extracted from real-world open source projects.
FLOURINE is an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program.
arXiv Detail & Related papers (2024-05-19T10:54:03Z) - Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages.
Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs.
In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement [26.26493253161022]
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT)
We introduce a systematic LLM-based self-refinement translation framework, named textbfTEaR.
arXiv Detail & Related papers (2024-02-26T07:58:12Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis [103.89753784762445]
Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT)
This paper systematically investigates the advantages and challenges of LLMs for MMT.
We thoroughly evaluate eight popular LLMs, including ChatGPT and GPT-4.
arXiv Detail & Related papers (2023-04-10T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.