Related papers: Program Synthesis with Large Language Models

Program Synthesis with Large Language Models

URL: http://arxiv.org/abs/2108.07732v1
Date: Mon, 16 Aug 2021 03:57:30 GMT
Title: Program Synthesis with Large Language Models
Authors: Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, Charles Sutton
Abstract summary: We evaluate large language models for program synthesis in Python. We find that synthesis performance scales log-linearly with model size. We find that even our best models are generally unable to predict the output of a program given a specific input.
Score: 40.41120807053989
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

Related papers

Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing [0.0]
We introduce a bootstrapping algorithm for program synthesis that supports teaching models how to repair.<n>We show that bootstrapping consistently outperforms regular fine-tuning.<n>We find that there are issues with the example test cases in the training portion of the APPS dataset.
arXiv Detail & Related papers (2025-07-20T02:10:46Z)
Learning Program Behavioral Models from Synthesized Input-Output Pairs [70.9524884086882]
We introduce Modelizer, a framework that learns a _model from its input/output behavior using _neural machine translation_. Modelizer uses _grammars_ to synthesize inputs and to parse the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams. Other than input and output grammars, Modelizer only requires the ability to execute the program.
arXiv Detail & Related papers (2024-07-11T15:25:02Z)
Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation [0.0]
Large Language Models (LLMs) have become a popular choice for many Natural Language Processing (NLP) tasks. LLMs' substantial computational and memory requirements often make them inaccessible to users with limited resources. This paper focuses on very low-cost models which offer a more accessible alternative to resource-intensive LLMs.
arXiv Detail & Related papers (2024-04-17T08:16:48Z)
HumanEval on Latest GPT Models -- 2024 [2.3279007422505322]
This dataset was initally developed to be used with a language model called CODEGEN on natural and programming language data. The utility of these trained models is showcased by demonstrating their competitive performance in zero-shot Python code generation on HumanEval tasks.
arXiv Detail & Related papers (2024-02-20T04:17:21Z)
Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences. We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z)
Qwen Technical Report [132.54304067403922]
We introduce Qwen, the first installment of our large language model series. Qwen is the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. We have also developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat.
arXiv Detail & Related papers (2023-09-28T17:07:49Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Scaling Language Models: Methods, Analysis & Insights from Training Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language. We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.