DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation
- URL: http://arxiv.org/abs/2211.11501v1
- Date: Fri, 18 Nov 2022 17:20:27 GMT
- Title: DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation
- Authors: Yuhang Lai and Chengxi Li and Yiming Wang and Tianyi Zhang and Ruiqi
Zhong and Luke Zettlemoyer and Scott Wen-tau Yih and Daniel Fried and Sida
Wang and Tao Yu
- Abstract summary: DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
- Score: 70.96868419971756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DS-1000, a code generation benchmark with a thousand data
science problems spanning seven Python libraries, such as NumPy and Pandas.
Compared to prior works, DS-1000 incorporates three core features. First, our
problems reflect diverse, realistic, and practical use cases since we collected
them from StackOverflow. Second, our automatic evaluation is highly specific
(reliable) -- across all Codex-002-predicted solutions that our evaluation
accept, only 1.8% of them are incorrect; we achieve this with multi-criteria
metrics, checking both functional correctness by running test cases and
surface-form constraints by restricting API usages or keywords. Finally, we
proactively defend against memorization by slightly modifying our problems to
be different from the original StackOverflow source; consequently, models
cannot answer them correctly by memorizing the solutions from pre-training. The
current best public system (Codex-002) achieves 43.3% accuracy, leaving ample
room for improvement. We release our benchmark at
https://ds1000-code-gen.github.io.
Related papers
- Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [9.48622608877252]
Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1 in solving Python coding problems.
REPOCOD is a code generation benchmark with 980 problems collected from 11 popular real-world projects.
Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation.
arXiv Detail & Related papers (2024-10-29T01:21:05Z) - Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages.
We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep
Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
Each baseline is a self-contained experiment pipeline with easily reusable and extendable components.
We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and
Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers.
We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs.
Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.