Related papers: From LLMs to Agents in Programming: The Impact of Providing an LLM with a Compiler

From LLMs to Agents in Programming: The Impact of Providing an LLM with a Compiler

URL: http://arxiv.org/abs/2601.12146v2
Date: Fri, 23 Jan 2026 08:51:50 GMT
Title: From LLMs to Agents in Programming: The Impact of Providing an LLM with a Compiler
Authors: Viktor Kjellberg, Miroslaw Staron, Farnaz Fotrousi,
Abstract summary: Large Language Models have demonstrated a remarkable capability in natural language and program generation and software development.<n>This paper studies the degree to which such agents benefit from access to software development tools, in our case, a gcc compiler.<n>We evaluate how the integration with a compiler shifts the role of the language model from a passive generator to an active agent capable of iteratively developing runnable programs based on feedback from the compiler.
Score: 2.7400724993677703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models have demonstrated a remarkable capability in natural language and program generation and software development. However, the source code generated by the LLMs does not always meet quality requirements and may fail to compile. Therefore, many studies evolve into agents that can reason about the problem before generating the source code for the solution. The goal of this paper is to study the degree to which such agents benefit from access to software development tools, in our case, a gcc compiler. We conduct a computational experiment on the RosettaCode dataset, on 699 programming tasks in C. We evaluate how the integration with a compiler shifts the role of the language model from a passive generator to an active agent capable of iteratively developing runnable programs based on feedback from the compiler. We evaluated 16 language models with sizes ranging from small (135 million) to medium (3 billion) and large (70 billion). Our results show that access to a compiler improved the compilation success by 5.3 to 79.4 percentage units in compilation without affecting the semantics of the generated program. Syntax errors dropped by 75%, and errors related to undefined references dropped by 87% for the tasks where the agents outperformed the baselines. We also observed that in some cases, smaller models with a compiler outperform larger models with a compiler. We conclude that it is essential for LLMs to have access to software engineering tools to enhance their performance and reduce the need for large models in software engineering, such as reducing our energy footprint.

Related papers

AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents [3.0684671771686394]
This paper presents an empirical study investigating whether Large Language Models (LLMs) can be executed successfully in a clean environment with only OS packages.<n>We evaluate three state-of-the-art LLM coding agents across 300 projects generated from 100 standardized prompts in Python, JavaScript, and Java.<n>Our results show that only 68.3% of projects execute out-of-the-box, with substantial variation across languages.
arXiv Detail & Related papers (2025-12-26T21:17:22Z)
Exploring the Feasibility of End-to-End Large Language Model as a Compiler [20.15972226865971]
Large Language Model (LLM) technology has shown substantial advantages across various domains.<n>This paper explores the feasibility of LLM as a Compiler (LaaC) and its future directions.
arXiv Detail & Related papers (2025-11-06T07:21:42Z)
Context-Guided Decompilation: A Step Towards Re-executability [50.71992919223209]
Binary decompilation plays an important role in software security analysis, reverse engineering and malware understanding.<n>Recent advances in large language models (LLMs) have enabled neural decompilation, but the generated code is typically only semantically plausible.<n>We propose ICL4Decomp, a hybrid decompilation framework that leverages in-context learning (ICL) to guide LLMs toward generating re-executable source code.
arXiv Detail & Related papers (2025-11-03T17:21:39Z)
CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System [52.048087777953064]
We propose CompileAgent, an agent framework dedicated to repo-level compilation.<n>CompileAgent integrates five tools and a flow-based agent strategy, enabling interaction with software artifacts for compilation instruction search and error resolution.<n>We show that our method significantly improves the compilation success rate, ranging from 10% to 71%.
arXiv Detail & Related papers (2025-05-07T08:59:14Z)
LLM Agents Making Agent Tools [2.5529148902034637]
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks.<n>But these tools must be implemented in advance by human developers.<n>We propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools.
arXiv Detail & Related papers (2025-02-17T11:44:11Z)
Meta Large Language Model Compiler: Foundation Models of Compiler Optimization [21.161784011956126]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. We introduce Meta Large Language Model Compiler (LLM Compiler), a suite of robust, openly available, pre-trained models for code optimization tasks.
arXiv Detail & Related papers (2024-06-27T21:47:48Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks.<n>Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.<n>We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC) SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness. We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z)
Output Format Biases in the Evaluation of Large Language Models for Code Translation [6.75681623173699]
It is crucial to understand and address variations in output format.<n>Non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons.<n>We propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs.
arXiv Detail & Related papers (2024-03-25T21:41:31Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Dcc --help: Generating Context-Aware Compiler Error Explanations with Large Language Models [53.04357141450459]
dcc --help was deployed to our CS1 and CS2 courses, with 2,565 students using the tool over 64,000 times in ten weeks. We found that the LLM-generated explanations were conceptually accurate in 90% of compile-time and 75% of run-time cases, but often disregarded the instruction not to provide solutions in code.
arXiv Detail & Related papers (2023-08-23T02:36:19Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
Beyond the C: Retargetable Decompilation using Neural Machine Translation [5.734661402742406]
We develop a prototype decompiler that is easily retargetable to new languages. We examine the impact of parameters such as tokenization and training data selection on the quality of decompilation. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.
arXiv Detail & Related papers (2022-12-17T20:45:59Z)
Compilable Neural Code Generation with Compiler Feedback [43.97362484564799]
This paper proposes a three-stage pipeline for compilable code generation, including language model fine-tuning, compilability reinforcement, and compilability discrimination. Experiments on two code generation tasks demonstrate the effectiveness of our proposed approach, improving the success rate of compilation from 44.18 to 89.18 on average and from 70.3 to 96.2 in text-to-code generation, respectively.
arXiv Detail & Related papers (2022-03-10T03:15:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.