Program Synthesis with Large Language Models
- URL: http://arxiv.org/abs/2108.07732v1
- Date: Mon, 16 Aug 2021 03:57:30 GMT
- Title: Program Synthesis with Large Language Models
- Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le,
Charles Sutton
- Abstract summary: We evaluate large language models for program synthesis in Python.
We find that synthesis performance scales log-linearly with model size.
We find that even our best models are generally unable to predict the output of a program given a specific input.
- Score: 40.41120807053989
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores the limits of the current generation of large language
models for program synthesis in general purpose programming languages. We
evaluate a collection of such models (with between 244M and 137B parameters) on
two new benchmarks, MBPP and MathQA-Python, in both the few-shot and
fine-tuning regimes. Our benchmarks are designed to measure the ability of
these models to synthesize short Python programs from natural language
descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974
programming tasks, designed to be solvable by entry-level programmers. The
MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914
problems that evaluate the ability of the models to synthesize code from more
complex text. On both datasets, we find that synthesis performance scales
log-linearly with model size. Our largest models, even without finetuning on a
code dataset, can synthesize solutions to 59.6 percent of the problems from
MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a
held-out portion of the dataset improves performance by about 10 percentage
points across most model sizes. On the MathQA-Python dataset, the largest
fine-tuned model achieves 83.8 percent accuracy. Going further, we study the
model's ability to engage in dialog about code, incorporating human feedback to
improve its solutions. We find that natural language feedback from a human
halves the error rate compared to the model's initial prediction. Additionally,
we conduct an error analysis to shed light on where these models fall short and
what types of programs are most difficult to generate. Finally, we explore the
semantic grounding of these models by fine-tuning them to predict the results
of program execution. We find that even our best models are generally unable to
predict the output of a program given a specific input.
Related papers
- Learning Program Behavioral Models from Synthesized Input-Output Pairs [70.9524884086882]
We introduce Modelizer, a framework that learns a _model from its input/output behavior using _neural machine translation_.
Modelizer uses _grammars_ to synthesize inputs and to parse the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams.
Other than input and output grammars, Modelizer only requires the ability to execute the program.
arXiv Detail & Related papers (2024-07-11T15:25:02Z) - Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation [0.0]
Large Language Models (LLMs) have become a popular choice for many Natural Language Processing (NLP) tasks.
LLMs' substantial computational and memory requirements often make them inaccessible to users with limited resources.
This paper focuses on very low-cost models which offer a more accessible alternative to resource-intensive LLMs.
arXiv Detail & Related papers (2024-04-17T08:16:48Z) - HumanEval on Latest GPT Models -- 2024 [2.3279007422505322]
This dataset was initally developed to be used with a language model called CODEGEN on natural and programming language data.
The utility of these trained models is showcased by demonstrating their competitive performance in zero-shot Python code generation on HumanEval tasks.
arXiv Detail & Related papers (2024-02-20T04:17:21Z) - Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences.
We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z) - Qwen Technical Report [132.54304067403922]
We introduce Qwen, the first installment of our large language model series.
Qwen is the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques.
We have also developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat.
arXiv Detail & Related papers (2023-09-28T17:07:49Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.