Is Self-Repair a Silver Bullet for Code Generation?
- URL: http://arxiv.org/abs/2306.09896v5
- Date: Fri, 2 Feb 2024 18:31:34 GMT
- Title: Is Self-Repair a Silver Bullet for Code Generation?
- Authors: Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao,
Armando Solar-Lezama
- Abstract summary: Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks.
Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance.
We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
- Score: 68.02601393906083
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have shown remarkable aptitude in code generation, but
still struggle to perform complex tasks. Self-repair -- in which the model
debugs and repairs its own code -- has recently become a popular way to boost
performance in these settings. However, despite its increasing popularity,
existing studies of self-repair have been limited in scope; in many settings,
its efficacy thus remains poorly understood. In this paper, we analyze Code
Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken
from HumanEval and APPS. We find that when the cost of carrying out repair is
taken into account, performance gains are often modest, vary a lot between
subsets of the data, and are sometimes not present at all. We hypothesize that
this is because self-repair is bottlenecked by the model's ability to provide
feedback on its own code; using a stronger model to artificially boost the
quality of the feedback, we observe substantially larger performance gains.
Similarly, a small-scale study in which we provide GPT-4 with feedback from
human participants suggests that even for the strongest models, self-repair
still lags far behind what can be achieved with human-level debugging.
Related papers
- Iterative Deepening Sampling for Large Language Models [27.807695570974644]
Training models to achieve effective self-correction and self-correction remains a significant challenge.
We propose a novel iterative sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z) - Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities.
It self-improves by training on synthetic data, generated by a contrastive-based self-critic.
It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models [10.449015816015566]
Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference.
We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap.
We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance.
arXiv Detail & Related papers (2024-12-03T18:47:26Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - A Theoretical Understanding of Self-Correction through In-context Alignment [51.622068973630796]
Large language models (LLMs) are capable of improving their abilities purely by self-correction.
We show that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way.
Inspired by these findings, we also illustrate applications of self-correction, such as defending against LLM jailbreaks.
arXiv Detail & Related papers (2024-05-28T22:33:02Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak
Supervision [55.196139002977525]
Superhuman models will behave in complex ways too difficult for humans to reliably evaluate.
Can weak model supervision elicit the full capabilities of a much stronger model?
We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors.
arXiv Detail & Related papers (2023-12-14T23:07:33Z) - RL4F: Generating Natural Language Feedback with Reinforcement Learning
for Repairing Model Outputs [27.777809444120827]
Previous work proposed providing language models with natural language feedback to guide them in repairing their outputs.
We introduce RL4F, a multi-agent collaborative framework where critique generator is trained to maximize end-task performance of GPT-3.
We show relative improvements up to 10% in multiple text similarity metrics over other learned, retrieval-augmented or prompting-based critique generators.
arXiv Detail & Related papers (2023-05-15T17:57:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.