Is Self-Repair a Silver Bullet for Code Generation?
- URL: http://arxiv.org/abs/2306.09896v5
- Date: Fri, 2 Feb 2024 18:31:34 GMT
- Title: Is Self-Repair a Silver Bullet for Code Generation?
- Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao,
Armando Solar-Lezama
- Abstract summary: Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks.
Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance.
We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
- Score: 68.02601393906083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have shown remarkable aptitude in code generation, but
still struggle to perform complex tasks. Self-repair -- in which the model
debugs and repairs its own code -- has recently become a popular way to boost
performance in these settings. However, despite its increasing popularity,
existing studies of self-repair have been limited in scope; in many settings,
its efficacy thus remains poorly understood. In this paper, we analyze Code
Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken
from HumanEval and APPS. We find that when the cost of carrying out repair is
taken into account, performance gains are often modest, vary a lot between
subsets of the data, and are sometimes not present at all. We hypothesize that
this is because self-repair is bottlenecked by the model's ability to provide
feedback on its own code; using a stronger model to artificially boost the
quality of the feedback, we observe substantially larger performance gains.
Similarly, a small-scale study in which we provide GPT-4 with feedback from
human participants suggests that even for the strongest models, self-repair
still lags far behind what can be achieved with human-level debugging.
Related papers
- FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks [28.849481030601666]
We introduce FeedbackEval, a benchmark for evaluating large language models' feedback comprehension and performance.
We conduct a comprehensive empirical study on five state-of-the-art LLMs, including GPT-4o, Claude-3.5, Gemini-1.5, GLM-4, and Qwen2.5.
Our results show that structured feedback, particularly in the form of test feedback, leads to the highest repair success rates, while unstructured feedback proves significantly less effective.
arXiv Detail & Related papers (2025-04-09T14:43:08Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.
However, improvement is plateauing due to the exhaustion of readily available high-quality data.
We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.
We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.
We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.
We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction.
We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.
Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - CYCLE: Learning to Self-Refine the Code Generation [19.71833229434497]
We propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback.
We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B benchmarks.
The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs.
arXiv Detail & Related papers (2024-03-27T16:45:02Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z) - RL4F: Generating Natural Language Feedback with Reinforcement Learning
for Repairing Model Outputs [27.777809444120827]
Previous work proposed providing language models with natural language feedback to guide them in repairing their outputs.
We introduce RL4F, a multi-agent collaborative framework where critique generator is trained to maximize end-task performance of GPT-3.
We show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
arXiv Detail & Related papers (2023-05-15T17:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.