A Static Evaluation of Code Completion by Large Language Models
- URL: http://arxiv.org/abs/2306.03203v1
- Date: Mon, 5 Jun 2023 19:23:34 GMT
- Title: A Static Evaluation of Code Completion by Large Language Models
- Authors: Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski,
Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia,
Sudipta Sengupta, Dan Roth, Bing Xiang
- Abstract summary: Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
- Score: 65.18008807383816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models trained on code have shown great potential to increase
productivity of software developers. Several execution-based benchmarks have
been proposed to evaluate functional correctness of model-generated code on
simple programming problems. Nevertheless, it is expensive to perform the same
evaluation on complex real-world projects considering the execution cost. On
the contrary, static analysis tools such as linters, which can detect errors
without running the program, haven't been well explored for evaluating code
generation models. In this work, we propose a static evaluation framework to
quantify static errors in Python code completions, by leveraging Abstract
Syntax Trees. Compared with execution-based evaluation, our method is not only
more efficient, but also applicable to code in the wild. For experiments, we
collect code context from open source repos to generate one million function
bodies using public models. Our static analysis reveals that Undefined Name and
Unused Variable are the most common errors among others made by language
models. Through extensive studies, we also show the impact of sampling
temperature, model size, and context on static errors in code completions.
Related papers
- RepoMasterEval: Evaluating Code Completion via Real-World Repositories [12.176098357240095]
RepoMasterEval is a novel benchmark for evaluating code completion models constructed from real-world Python and TypeScript repositories.
To improve test accuracy of model generated code, we employ mutation testing to measure the effectiveness of the test cases.
Our empirical evaluation on 6 state-of-the-art models shows that test argumentation is critical in improving the accuracy of the benchmark.
arXiv Detail & Related papers (2024-08-07T03:06:57Z) - NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs.
We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z) - Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development.
In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z) - Better Context Makes Better Code Language Models: A Case Study on
Function Call Argument Completion [15.068025336990287]
We show that existing code completion models do not yield good results on our completion task.
We query a program analyzer for information relevant to a given function call, and consider ways to provide the analyzer results to different code completion models during inference and training.
Our experiments show that providing access to the function implementation and function usages greatly improves the argument completion performance.
arXiv Detail & Related papers (2023-06-01T06:25:58Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - FixEval: Execution-based Evaluation of Program Fixes for Programming
Problems [23.987104440395576]
We introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes.
FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes.
Our experiments show that match-based metrics do not reflect model-generated program fixes accurately.
arXiv Detail & Related papers (2022-06-15T20:18:43Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another.
We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.