CodeFusion: A Pre-trained Diffusion Model for Code Generation
- URL: http://arxiv.org/abs/2310.17680v3
- Date: Wed, 1 Nov 2023 17:30:47 GMT
- Title: CodeFusion: A Pre-trained Diffusion Model for Code Generation
- Authors: Mukul Singh, Jos\'e Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu,
Gust Verbruggen
- Abstract summary: Auto-regressive models for code generation from natural language do not easily allow reconsidering earlier tokens generated.
We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language.
Experiments show that CodeFusion performs on par with state-of-the-art auto-regressive systems.
- Score: 17.187094058627615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Imagine a developer who can only change their last line of code, how often
would they have to start writing a function from scratch before it is correct?
Auto-regressive models for code generation from natural language have a similar
limitation: they do not easily allow reconsidering earlier tokens generated. We
introduce CodeFusion, a pre-trained diffusion code generation model that
addresses this limitation by iteratively denoising a complete program
conditioned on the encoded natural language. We evaluate CodeFusion on the task
of natural language to code generation for Bash, Python, and Microsoft Excel
conditional formatting (CF) rules. Experiments show that CodeFusion (75M
parameters) performs on par with state-of-the-art auto-regressive systems
(350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and
top-5 accuracy due to its better balance in diversity versus quality.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code [75.08995072899594]
We propose CodeBERTScore: an evaluation metric for code generation.
CodeBERTScore encodes the natural language input preceding the generated code.
We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics.
arXiv Detail & Related papers (2023-02-10T22:12:05Z) - Syntax-Aware On-the-Fly Code Completion [13.268277642411974]
We propose PyCoder to leverage token types, a kind of lightweight syntactic information.
Our PyCoder achieves the first rank on the CodeXGLUE leaderboard with an accuracy of 77.12% for the token-level predictions.
arXiv Detail & Related papers (2022-11-09T04:24:18Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - NatGen: Generative pre-training by "Naturalizing" source code [18.410818213965918]
We propose a new pre-training objective, "Naturalizing" of source code.
Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale.
We fine-tune our model in three generative Software Engineering tasks to achieve state-of-the-art performance rivaling CodeT5.
arXiv Detail & Related papers (2022-06-15T15:08:29Z) - Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection.
We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - Automatic Code Generation using Pre-Trained Language Models [0.0]
We propose an end-to-end machine learning model for code generation in the Python language built on-top of pre-trained language models.
We demonstrate that a fine-tuned model can perform well in code generation tasks, achieving a BLEU score of 0.22, an improvement of 46% over a reasonable sequence-to-sequence baseline.
arXiv Detail & Related papers (2021-02-21T07:21:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.