Fully Autonomous Programming with Large Language Models
- URL: http://arxiv.org/abs/2304.10423v1
- Date: Thu, 20 Apr 2023 16:12:05 GMT
- Title: Fully Autonomous Programming with Large Language Models
- Authors: Vadim Liventsev and Anastasiia Grishina and Aki H\"arm\"a and Leon
Moonen
- Abstract summary: Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome"
We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation.
The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.
- Score: 0.9558392439655015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current approaches to program synthesis with Large Language Models (LLMs)
exhibit a "near miss syndrome": they tend to generate programs that
semantically resemble the correct answer (as measured by text similarity
metrics or human evaluation), but achieve a low or even zero accuracy as
measured by unit tests due to small imperfections, such as the wrong input or
output format. This calls for an approach known as Synthesize, Execute, Debug
(SED), whereby a draft of the solution is generated first, followed by a
program repair phase addressing the failed tests. To effectively apply this
approach to instruction-driven LLMs, one needs to determine which prompts
perform best as instructions for LLMs, as well as strike a balance between
repairing unsuccessful programs and replacing them with newly generated ones.
We explore these trade-offs empirically, comparing replace-focused,
repair-focused, and hybrid debug strategies, as well as different
template-based and model-based prompt-generation techniques. We use OpenAI
Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem
descriptions and tests for evaluation. The resulting framework outperforms both
conventional usage of Codex without the repair phase and traditional genetic
programming approaches.
Related papers
- Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Code Repair with LLMs gives an Exploration-Exploitation Tradeoff [16.80314690163063]
Iteratively improving and repairing source code with large language models (LLMs) has emerged as a popular way of generating programs that would be too complex to construct in one shot.
We show here that refinement exposes an explore-exploit tradeoff: exploit by refining the program that passes the most test cases, or explore by refining a lesser considered program.
arXiv Detail & Related papers (2024-05-26T04:00:30Z) - HYSYNTH: Context-Free LLM Approximation for Guiding Program Synthesis [25.260063704712458]
Large language models (LLMs) often fail to produce fully correct programs in unfamiliar DSLs.
Motivated by these limitations, we introduce a hybrid approach, where LLM completions for a given task are used to learn a task-specific, context-free surrogate model.
We evaluate this hybrid approach on three domains, and show that it outperforms both unguided search and direct sampling from LLMs, as well as existing program synthesizers.
arXiv Detail & Related papers (2024-05-24T18:45:51Z) - Automated Program Repair: Emerging trends pose and expose problems for benchmarks [7.437224586066947]
Large language models (LLMs) are used to generate software patches.
Evaluations and comparisons must take care to ensure that results are valid and likely to generalize.
This is especially true for LLMs, whose large and often poorly-disclosed training datasets may include problems on which they are evaluated.
arXiv Detail & Related papers (2024-05-08T23:09:43Z) - Benchmarking Educational Program Repair [4.981275578987307]
Large language models (LLMs) can be used to generate learning resources, improve error messages, and provide feedback on code.
There is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches.
In this article, we propose a novel educational program repair benchmark.
arXiv Detail & Related papers (2024-05-08T18:23:59Z) - NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle
Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems.
We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness.
Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Learning from Self-Sampled Correct and Partially-Correct Programs [96.66452896657991]
We propose to let the model perform sampling during training and learn from both self-sampled fully-correct programs and partially-correct programs.
We show that our use of self-sampled correct and partially-correct programs can benefit learning and help guide the sampling process.
Our proposed method improves the pass@k performance by 3.1% to 12.3% compared to learning from a single reference program with MLE.
arXiv Detail & Related papers (2022-05-28T03:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.