Related papers: Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models

Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models

URL: http://arxiv.org/abs/2409.10506v1
Date: Mon, 16 Sep 2024 17:52:36 GMT
Title: Context-aware Code Segmentation for C-to-Rust Translation using Large Language Models
Authors: Momoko Shiraishi, Takahiro Shinagawa,
Abstract summary: Large language models (LLMs) show promise for automating this translation by generating more natural and safer code than rule-based methods. We propose an LLM-based translation scheme that improves the success rate of translating large-scale C code into compilable Rust code. In experiments with 20 benchmark C programs, including those exceeding 4 kilo lines of code, we successfully translated all programs into compilable Rust code.
Score: 1.8416014644193066
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There is strong motivation to translate C code into Rust code due to the continuing threat of memory safety vulnerabilities in existing C programs and the significant attention paid to Rust as an alternative to the C language. While large language models (LLMs) show promise for automating this translation by generating more natural and safer code than rule-based methods, previous studies have shown that LLM-generated Rust code often fails to compile, even for relatively small C programs, due to significant differences between the two languages and context window limitations. We propose an LLM-based translation scheme that improves the success rate of translating large-scale C code into compilable Rust code. Our approach involves three key techniques: (1) pre-processing the C code to better align its structure and expressions with Rust, (2) segmenting the code into optimally sized translation units to avoid exceeding the LLM's context window limits, and (3) iteratively compiling and repairing errors while maintaining consistency between translation units using context-supplementing prompts. Compilation success is an essential first step in achieving functional equivalence, as only compilable code can be further tested. In experiments with 20 benchmark C programs, including those exceeding 4 kilo lines of code, we successfully translated all programs into compilable Rust code without losing corresponding parts of the original code.

Related papers

CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation [63.23120252801889]
CRUST-Bench is a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting.
arXiv Detail & Related papers (2025-04-21T17:33:33Z)
LLMigrate: Transforming "Lazy" Large Language Models into Efficient Source Code Migrators [21.114491141763647]
Rewriting C code in Rust provides stronger memory safety, yet migrating larges such as the 32-million-line Linux kernel remains challenging. Recent Large Language Model (LLM) approaches produce more idiomatic, safe Rust programs but frequently exhibit "laziness" LLM-based C-to-Rust translation tool splits modules into discrete functions, translating them individually, and then reintegrating them.
arXiv Detail & Related papers (2025-03-31T07:09:07Z)
RustMap: Towards Project-Scale C-to-Rust Migration via Program Analysis and LLM [13.584956125542396]
Rust offers superior memory safety while maintaining C's high performance. Existing automated translation tools, such as C2Rust, may rely too much on syntactic, template-based translation. This paper introduces a novel dependency-guided and large language model (LLM)-based C-to-Rust translation approach, RustMap.
arXiv Detail & Related papers (2025-03-22T11:57:45Z)
LLM-Driven Multi-step Translation from C to Rust using Static Analysis [27.122409727034192]
Translating software written in legacy languages to modern languages, such as C to Rust, has significant benefits in improving memory safety. We propose SACTOR, an LLM-driven C-to-Rust zero-shot translation tool using a two-step translation methodology. SACTOR produces more natural and Rust-compliant translations compared to existing methods.
arXiv Detail & Related papers (2025-03-16T14:05:26Z)
Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [8.361424157571468]
Syzygy is an automated approach to translate C to safe Rust. This is the largest automated and test-validated C to safe Rust code translation achieved so far.
arXiv Detail & Related papers (2024-12-18T18:55:46Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
VERT: Verified Equivalent Rust Transpilation with Large Language Models as Few-Shot Learners [6.824327908701066]
Rust is a programming language that combines memory safety and low-level control, providing C-like performance. Existing work falls into two categories: rule-based and large language model (LLM)-based. We present VERT, a tool that can produce readable Rust transpilations with formal guarantees of correctness.
arXiv Detail & Related papers (2024-04-29T16:45:03Z)
Enabling Memory Safety of C Programs using LLMs [5.297072277460838]
Memory safety violations in low-level code, written in languages like C, continue to remain one of the major sources of software vulnerabilities. One method of removing such violations by construction is to port C code to a safe C dialect. Such dialects rely on programmer-supplied annotations to guarantee safety with minimal runtime overhead. This porting is a manual process that imposes significant burden on the programmer and hence, there has been limited adoption of this technique.
arXiv Detail & Related papers (2024-04-01T13:05:54Z)
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs. We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files. Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
Fixing Rust Compilation Errors using LLMs [2.1781086368581932]
The Rust programming language has established itself as a viable choice for low-level systems programming language over the traditional, unsafe alternatives like C/C++. This paper presents a tool called RustAssistant that leverages the emergent capabilities of Large Language Models (LLMs) to automatically suggest fixes for Rust compilation errors. RustAssistant is able to achieve an impressive peak accuracy of roughly 74% on real-world compilation errors in popular open-source Rust repositories.
arXiv Detail & Related papers (2023-08-09T18:30:27Z)
Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain. We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement. We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z)
CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context [82.88371379927112]
We propose a framework that incorporates cross-file context to learn the in-file and cross-file context jointly on top of pretrained code LMs. CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and a 28.69% relative increase in identifier matching for code completion when the cross-file context is provided.
arXiv Detail & Related papers (2022-12-20T05:48:09Z)
Code Translation with Compiler Representations [21.702473137941006]
Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages.
arXiv Detail & Related papers (2022-06-30T14:21:57Z)
Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection. We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z)
Synchromesh: Reliable code generation from pre-trained language models [38.15391794443022]
We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation. First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection. Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
arXiv Detail & Related papers (2022-01-26T22:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.