LEVER: Learning to Verify Language-to-Code Generation with Execution
- URL: http://arxiv.org/abs/2302.08468v3
- Date: Fri, 1 Sep 2023 17:37:42 GMT
- Title: LEVER: Learning to Verify Language-to-Code Generation with Execution
- Authors: Ansong Ni, Srini Iyer, Dragomir Radev, Ves Stoyanov, Wen-tau Yih, Sida
I. Wang, Xi Victoria Lin
- Abstract summary: We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
- Score: 64.36459105535
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The advent of large language models trained on code (code LLMs) has led to
significant progress in language-to-code generation. State-of-the-art
approaches in this area combine LLM decoding with sample pruning and reranking
using test cases or heuristics based on the execution results. However, it is
challenging to obtain test cases for many real-world language-to-code
applications, and heuristics cannot well capture the semantic features of the
execution results, such as data type and value range, which often indicates the
correctness of the program. In this work, we propose LEVER, a simple approach
to improve language-to-code generation by learning to verify the generated
programs with their execution results. Specifically, we train verifiers to
determine whether a program sampled from the LLMs is correct or not based on
the natural language input, the program itself and its execution results. The
sampled programs are reranked by combining the verification score with the LLM
generation probability, and marginalizing over programs with the same execution
results. On four datasets across the domains of table QA, math QA and basic
Python programming, LEVER consistently improves over the base code LLMs(4.6% to
10.9% with code-davinci-002) and achieves new state-of-the-art results on all
of them.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - Towards Translating Real-World Code with LLMs: A Study of Translating to Rust [13.743967357458287]
Large language models (LLMs) show promise in code translation due to their ability to write code in most programming languages.
We conduct our study on code extracted from real-world open source projects.
FLOURINE is an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program.
arXiv Detail & Related papers (2024-05-19T10:54:03Z) - Executing Natural Language-Described Algorithms with Large Language Models: An Investigation [48.461999568129166]
We examine the capacity of present-day large language models to comprehend and execute algorithms outlined in natural language.
We selected 30 algorithms, generated 300 random-sampled instances, and evaluated whether popular LLMs can understand and execute these algorithms.
Our findings reveal that LLMs, notably GPT-4, can effectively execute programs described in natural language, as long as no heavy numeric computation is involved.
arXiv Detail & Related papers (2024-02-23T05:31:36Z) - Mutation-based Consistency Testing for Evaluating the Code Understanding
Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages.
We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions.
We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs.
We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection.
We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.