Synchromesh: Reliable code generation from pre-trained language models
- URL: http://arxiv.org/abs/2201.11227v1
- Date: Wed, 26 Jan 2022 22:57:44 GMT
- Title: Synchromesh: Reliable code generation from pre-trained language models
- Authors: Gabriel Poesia, Oleksandr Polozov, Vu Le, Ashish Tiwari, Gustavo
Soares, Christopher Meek, Sumit Gulwani
- Abstract summary: We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation.
First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection.
Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
- Score: 38.15391794443022
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large pre-trained language models have been used to generate code,providing a
flexible interface for synthesizing programs from natural language
specifications. However, they often violate syntactic and semantic rules of
their output language, limiting their practical usability. In this paper, we
propose Synchromesh: a framework for substantially improving the reliability of
pre-trained models for code generation. Synchromesh comprises two components.
First, it retrieves few-shot examples from a training bank using Target
Similarity Tuning (TST), a novel method for semantic example selection. TST
learns to recognize utterances that describe similar target programs despite
differences in surface natural language features. Then, Synchromesh feeds the
examples to a pre-trained language model and samples programs using Constrained
Semantic Decoding (CSD): a general framework for constraining the output to a
set of valid programs in the target language. CSD leverages constraints on
partial outputs to sample complete correct programs, and needs neither
re-training nor fine-tuning of the language model. We evaluate our methods by
synthesizing code from natural language descriptions using GPT-3 and Codex in
three real-world languages: SQL queries, Vega-Lite visualizations and SMCalFlow
programs. These domains showcase rich constraints that CSD is able to enforce,
including syntax, scope, typing rules, and contextual logic. We observe
substantial complementary gains from CSD and TST in prediction accuracy and in
effectively preventing run-time errors.
Related papers
- Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - SLFNet: Generating Semantic Logic Forms from Natural Language Using Semantic Probability Graphs [6.689539418123863]
Building natural language interfaces typically uses a semanticSlot to parse the user's natural language and convert it into structured textbfSemantic textbfLogic textbfForms (SLFs)
We propose a novel neural network, SLFNet, which incorporates dependent syntactic information as prior knowledge and can capture the long-range interactions between contextual information and words.
Experiments show that SLFNet achieves state-of-the-art performance on the ChineseQCI-TS and Okapi datasets, and competitive performance on the ATIS dataset
arXiv Detail & Related papers (2024-03-29T02:42:39Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - PanGu-Coder: Program Synthesis with Function-Level Language Modeling [47.63943623661298]
PanGu-Coder is a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation.
We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling to pre-train on raw programming language data.
The second stage uses a combination of Causal Language Modelling and Masked Language Modelling to train on loosely curated pairs of natural language program definitions and code functions.
arXiv Detail & Related papers (2022-07-22T18:08:16Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models
and Component-based Synthesis [15.427687814482724]
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms.
Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description.
We use our combination approach to instantiate multi-modal synthesis systems for two programming domains.
arXiv Detail & Related papers (2021-09-03T16:12:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.