Related papers: Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System

Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System

URL: http://arxiv.org/abs/2512.02567v1
Date: Tue, 02 Dec 2025 09:38:20 GMT
Title: Feedback Loops and Code Perturbations in LLM-based Software Engineering: A Case Study on a C-to-Rust Translation System
Authors: Martin Weiss, Jesko Hecking-Harbusch, Jochen Quante, Matthias Woehrle,
Abstract summary: We study the effect of three variables on an automated C-to-Rust translation system.<n>Our results show that without feedback loops LLM selection has a large effect on translation success.<n>We also identify that diversity provided by code perturbations can even result in improved system performance.
Score: 1.2566563622834341
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of strong generative AI has a considerable impact on various software engineering tasks such as code repair, test generation, or language translation. While tools like GitHub Copilot are already in widespread use in interactive settings, automated approaches require a higher level of reliability before being usable in industrial practice. In this paper, we focus on three aspects that directly influence the quality of the results: a) the effect of automated feedback loops, b) the choice of Large Language Model (LLM), and c) the influence of behavior-preserving code changes. We study the effect of these three variables on an automated C-to-Rust translation system. Code translation from C to Rust is an attractive use case in industry due to Rust's safety guarantees. The translation system is based on a generate-and-check pattern, in which Rust code generated by the LLM is automatically checked for compilability and behavioral equivalence with the original C code. For negative checking results, the LLM is re-prompted in a feedback loop to repair its output. These checks also allow us to evaluate and compare the respective success rates of the translation system when varying the three variables. Our results show that without feedback loops LLM selection has a large effect on translation success. However, when the translation system uses feedback loops the differences across models diminish. We observe this for the average performance of the system as well as its robustness under code perturbations. Finally, we also identify that diversity provided by code perturbations can even result in improved system performance.

Related papers

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models [5.31828955342405]
Large language models (LLMs) are increasingly used for automated code tasks.<n>This article systematically study the capabilities of LLMs for code readability.
arXiv Detail & Related papers (2026-02-25T12:05:25Z)
Code Quality Analysis of Translations from C to Rust [10.42011095909225]
C/C++ is a prevalent programming language. Yet, it suffers from significant memory and thread-safety issues.<n>Recent studies have explored automated translation of C/C++ to safer languages, such as Rust.<n>This work investigates strengths and weaknesses of three C-to-Rust translators, namely C2Rust (a transpiler), C2SaferRust (an LLM-guided transpiler), and TranslationGym.
arXiv Detail & Related papers (2026-01-31T18:12:03Z)
Protocode: Prototype-Driven Interpretability for Code Generation in LLMs [5.8296917468117835]
Large Language Models (LLMs) have been widely adopted for various tasks such as text summarization, question answering, speech-to-text translation, and more.<n>Our work focuses on automatically sampling In-Context Learning (ICL) demonstrations which can improve model performance and enhance the interpretability of the generated code.
arXiv Detail & Related papers (2025-09-27T00:32:45Z)
Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
SafeTrans: LLM-assisted Transpilation from C to Rust [5.6274106543826585]
Rust is a strong contender for a memory-safe alternative to C as a "systems" programming language.<n>In this paper, we evaluate the potential of large language models (LLMs) to automate the transpilation of C code to Rust.<n>We present the design and implementation of SafeTrans, a framework that uses LLMs to i) transpile C code into Rust and ii) iteratively fix any compilation and runtime errors.
arXiv Detail & Related papers (2025-05-15T21:05:33Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis [8.361424157571468]
Syzygy is an automated approach to translate C to safe Rust.<n>This is the largest automated and test-validated C to safe Rust code translation achieved so far.
arXiv Detail & Related papers (2024-12-18T18:55:46Z)
CYCLE: Learning to Self-Refine the Code Generation [19.71833229434497]
We propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B benchmarks. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs.
arXiv Detail & Related papers (2024-03-27T16:45:02Z)
Feedback Loops With Language Models Drive In-Context Reward Hacking [78.9830398771605]
We show that feedback loops can cause in-context reward hacking (ICRH) We identify and study two processes that lead to ICRH: output-refinement and policy-refinement. As AI development accelerates, the effects of feedback loops will proliferate.
arXiv Detail & Related papers (2024-02-09T18:59:29Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.