Measuring Coding Challenge Competence With APPS
- URL: http://arxiv.org/abs/2105.09938v1
- Date: Thu, 20 May 2021 17:58:42 GMT
- Title: Measuring Coding Challenge Competence With APPS
- Authors: Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika
and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He
and Dawn Song and Jacob Steinhardt
- Abstract summary: We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
- Score: 54.22600767666257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While programming is one of the most broadly applicable skills in modern
society, modern machine learning models still cannot code solutions to basic
problems. It can be difficult to accurately assess code generation performance,
and there has been surprisingly little work on evaluating code generation in a
way that is both flexible and rigorous. To meet this challenge, we introduce
APPS, a benchmark for code generation. Unlike prior work in more restricted
settings, our benchmark measures the ability of models to take an arbitrary
natural language specification and generate Python code fulfilling this
specification. Similar to how companies assess candidate software developers,
we then evaluate models by checking their generated code on test cases. Our
benchmark includes 10,000 problems, which range from having simple one-line
solutions to being substantial algorithmic challenges. We fine-tune large
language models on both GitHub and our training set, and we find that the
prevalence of syntax errors is decreasing exponentially. Recent models such as
GPT-Neo can pass approximately 15% of the test cases of introductory problems,
so we find that machine learning models are beginning to learn how to code. As
the social significance of automatic code generation increases over the coming
years, our benchmark can provide an important measure for tracking
advancements.
Related papers
- Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation [0.24578723416255752]
We evaluate five different large language models (LLMs) concerning their capabilities for text-to-code generation.
ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama.
arXiv Detail & Related papers (2024-09-06T10:03:49Z) - An Empirical Study on Self-correcting Large Language Models for Data Science Code Generation [1.335664823620186]
Large Language Models (LLMs) have recently advanced many applications on software engineering tasks.
CoT-SelfEvolve iteratively and automatically refines code through a self-correcting process.
arXiv Detail & Related papers (2024-08-28T09:19:09Z) - PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs [1.9207412600219353]
We evaluate two popular benchmarks for Python code generation, analyzing their diversity and difficulty.
Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most of the other concepts entirely.
We propose a novel benchmark, PythonSaga, featuring 185 hand-crafted prompts on a balanced representation of 38 programming concepts.
arXiv Detail & Related papers (2024-01-08T12:36:43Z) - Exploring Continual Learning for Code Generation Models [80.78036093054855]
Continual Learning (CL) is an important aspect that remains underexplored in the code domain.
We introduce a benchmark called CodeTask-CL that covers a wide range of tasks, including code generation, translation, summarization, and refinement.
We find that effective methods like Prompt Pooling (PP) suffer from catastrophic forgetting due to the unstable training of the prompt selection mechanism.
arXiv Detail & Related papers (2023-07-05T16:58:39Z) - The Good, the Bad, and the Missing: Neural Code Generation for Machine
Learning Tasks [11.837851107416588]
This paper investigates the effectiveness of existing neural code generation models on Machine Learning programming tasks.
We select six state-of-the-art neural code generation models, and evaluate their performance on four widely used ML libraries.
Our empirical study reveals some good, bad, and missing aspects of neural code generation models on ML tasks.
arXiv Detail & Related papers (2023-05-16T00:52:02Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - Automatic Generation of Programming Exercises and Code Explanations with
Large Language Models [4.947560475228859]
OpenAI Codex is a recent large language model from the GPT-3 family for translating code into natural language.
We explore the natural language generation capabilities of Codex in two different phases of the life of a programming exercise.
We find the majority of this automatically generated content both novel and sensible, and in many cases ready to use as is.
arXiv Detail & Related papers (2022-06-03T11:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.