The Good, the Bad, and the Missing: Neural Code Generation for Machine
Learning Tasks
- URL: http://arxiv.org/abs/2305.09082v1
- Date: Tue, 16 May 2023 00:52:02 GMT
- Title: The Good, the Bad, and the Missing: Neural Code Generation for Machine
Learning Tasks
- Authors: Jiho Shin, Moshi Wei, Junjie Wang, Lin Shi, Song Wang
- Abstract summary: This paper investigates the effectiveness of existing neural code generation models on Machine Learning programming tasks.
We select six state-of-the-art neural code generation models, and evaluate their performance on four widely used ML libraries.
Our empirical study reveals some good, bad, and missing aspects of neural code generation models on ML tasks.
- Score: 11.837851107416588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) has been increasingly used in a variety of domains,
while solving ML programming tasks poses unique challenges because of the
fundamentally different nature and construction from general programming tasks,
especially for developers who do not have ML backgrounds. Automatic code
generation that produces a code snippet from a natural language description can
be a promising technique to accelerate ML programming tasks. In recent years,
although many deep learning-based neural code generation models have been
proposed with high accuracy, the fact that most of them are mainly evaluated on
general programming tasks calls into question their effectiveness and
usefulness in ML programming tasks. In this paper, we set out to investigate
the effectiveness of existing neural code generation models on ML programming
tasks. For our analysis, we select six state-of-the-art neural code generation
models, and evaluate their performance on four widely used ML libraries, with
newly-created 83K pairs of natural-language described ML programming tasks. Our
empirical study reveals some good, bad, and missing aspects of neural code
generation models on ML tasks, with a few major ones listed below. (Good)
Neural code generation models perform significantly better on ML tasks than on
non-ML tasks. (Bad) Most of the generated code is semantically incorrect. (Bad)
Code generation models cannot significantly improve developers' completion
time. (Good) The generated code can help developers write more correct code by
providing developers with clues for using correct APIs. (Missing) The
observation from our user study reveals the missing aspects of code generation
for ML tasks, e.g., decomposing code generation for divide-and-conquer into two
tasks: API sequence identification and API usage generation.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation [32.178931149612644]
Large Language Models (LLMs) have already gained widespread adoption in software engineering, particularly in code generation tasks.
We propose textscMENT, a novel and effective model editing approach to repair LLMs in coding tasks.
textscMENT is effective, efficient, and reliable, capable of correcting a neural model by patching just one or two neurons.
arXiv Detail & Related papers (2023-12-08T20:28:08Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Do Large Language Models Pay Similar Attention Like Human Programmers When Generating Code? [10.249771123421432]
We investigate whether Large Language Models (LLMs) attend to the same parts of a task description as human programmers during code generation.
We manually analyzed 211 incorrect code snippets and found five attention patterns that can be used to explain many code generation errors.
Our findings highlight the need for human-aligned LLMs for better interpretability and programmer trust.
arXiv Detail & Related papers (2023-06-02T00:57:03Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - NatGen: Generative pre-training by "Naturalizing" source code [18.410818213965918]
We propose a new pre-training objective, "Naturalizing" of source code.
Unlike natural language, code's bimodal, dual-channel nature allows us to generate semantically equivalent code at scale.
We fine-tune our model in three generative Software Engineering tasks to achieve state-of-the-art performance rivaling CodeT5.
arXiv Detail & Related papers (2022-06-15T15:08:29Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.