Is Self-Repair a Silver Bullet for Code Generation?
- URL: http://arxiv.org/abs/2306.09896v5
- Date: Fri, 2 Feb 2024 18:31:34 GMT
- Title: Is Self-Repair a Silver Bullet for Code Generation?
- Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao,
Armando Solar-Lezama
- Abstract summary: Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks.
Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance.
We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
- Score: 68.02601393906083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have shown remarkable aptitude in code generation, but
still struggle to perform complex tasks. Self-repair -- in which the model
debugs and repairs its own code -- has recently become a popular way to boost
performance in these settings. However, despite its increasing popularity,
existing studies of self-repair have been limited in scope; in many settings,
its efficacy thus remains poorly understood. In this paper, we analyze Code
Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken
from HumanEval and APPS. We find that when the cost of carrying out repair is
taken into account, performance gains are often modest, vary a lot between
subsets of the data, and are sometimes not present at all. We hypothesize that
this is because self-repair is bottlenecked by the model's ability to provide
feedback on its own code; using a stronger model to artificially boost the
quality of the feedback, we observe substantially larger performance gains.
Similarly, a small-scale study in which we provide GPT-4 with feedback from
human participants suggests that even for the strongest models, self-repair
still lags far behind what can be achieved with human-level debugging.
Related papers
- Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.
We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction.
We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.
Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - CYCLE: Learning to Self-Refine the Code Generation [19.71833229434497]
We propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback.
We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B benchmarks.
The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs.
arXiv Detail & Related papers (2024-03-27T16:45:02Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z) - RL4F: Generating Natural Language Feedback with Reinforcement Learning
for Repairing Model Outputs [27.777809444120827]
Previous work proposed providing language models with natural language feedback to guide them in repairing their outputs.
We introduce RL4F, a multi-agent collaborative framework where critique generator is trained to maximize end-task performance of GPT-3.
We show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
arXiv Detail & Related papers (2023-05-15T17:57:16Z) - Aligning Offline Metrics and Human Judgments of Value for Code
Generation Models [25.726216146776054]
We show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task.
We propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value.
arXiv Detail & Related papers (2022-10-29T05:03:28Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.