Related papers: Fully Autonomous Programming with Large Language Models

Fully Autonomous Programming with Large Language Models

URL: http://arxiv.org/abs/2304.10423v1
Date: Thu, 20 Apr 2023 16:12:05 GMT
Title: Fully Autonomous Programming with Large Language Models
Authors: Vadim Liventsev and Anastasiia Grishina and Aki H\"arm\"a and Leon Moonen
Abstract summary: Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome" We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
Score: 0.9558392439655015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. This calls for an approach known as Synthesize, Execute, Debug (SED), whereby a draft of the solution is generated first, followed by a program repair phase addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, as well as strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.

Related papers

SymRTLO: Enhancing RTL Code Optimization with LLMs and Neuron-Inspired Symbolic Reasoning [18.40402135952776]
This paper presents SymRTLO, a novel neuron-symbolic RTL optimization framework. A symbolic module is proposed for analyzing and optimizing finite state machine (FSM) logic. Experiments on the RTL-Rewriter benchmark with Synopsys Design Compiler and Yosys show that SymRTLO improves power, performance, and area (PPA) by up to 43.9%, 62.5%, and 51.1%, respectively.
arXiv Detail & Related papers (2025-04-14T16:15:55Z)
Self-Steering Language Models [113.96916935955842]
DisCIPL is a method for "self-steering" language models. DisCIPL uses a Planner model to generate a task-specific inference program. Our work opens up a design space of highly-parallelized Monte Carlo inference strategies.
arXiv Detail & Related papers (2025-04-09T17:54:22Z)
Fully Autonomous Programming using Iterative Multi-Agent Debugging with Large Language Models [8.70160958177614]
Program synthesis with Large Language Models (LLMs) suffers from a "near-miss syndrome" We address this with a multi-agent framework called Synthesize, Execute, Instruct, Debug, and Repair (SEIDR) We empirically explore these trade-offs by comparing replace-focused, repair-focused, and hybrid debug strategies.
arXiv Detail & Related papers (2025-03-10T16:56:51Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications. We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines. We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff [16.80314690163063]
Iteratively improving and repairing source code with large language models (LLMs) has emerged as a popular way of generating programs that would be too complex to construct in one shot. We show here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program.
arXiv Detail & Related papers (2024-05-26T04:00:30Z)
HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis [25.260063704712458]
Large language models (LLMs) often fail to produce fully correct programs in unfamiliar DSLs. Motivated by these limitations, we introduce a hybrid approach, where LLM completions for a given task are used to learn a task-specific, context-free surrogate model. We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers.
arXiv Detail & Related papers (2024-05-24T18:45:51Z)
Automated Program Repair: Emerging trends pose and expose problems for benchmarks [7.437224586066947]
Large language models (LLMs) are used to generate software patches. Evaluations and comparisons must take care to ensure that results are valid and likely to generalize. This is especially true for LLMs, whose large and often poorly-disclosed training datasets may include problems on which they are evaluated.
arXiv Detail & Related papers (2024-05-08T23:09:43Z)
Benchmarking Educational Program Repair [4.981275578987307]
Large language models (LLMs) can be used to generate learning resources, improve error messages, and provide feedback on code. There is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches. In this article, we propose a novel educational program repair benchmark.
arXiv Detail & Related papers (2024-05-08T18:23:59Z)
NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs. We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z)
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems. We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
Learning from Self-Sampled Correct and Partially-Correct Programs [96.66452896657991]
We propose to let the model perform sampling during training and learn from both self-sampled fully-correct programs and partially-correct programs. We show that our use of self-sampled correct and partially-correct programs can benefit learning and help guide the sampling process. Our proposed method improves the pass@k performance by 3.1% to 12.3% compared to learning from a single reference program with MLE.
arXiv Detail & Related papers (2022-05-28T03:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.