DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation
- URL: http://arxiv.org/abs/2211.11501v1
- Date: Fri, 18 Nov 2022 17:20:27 GMT
- Title: DS-1000: A Natural and Reliable Benchmark for Data Science Code
Generation
- Authors: Yuhang Lai and Chengxi Li and Yiming Wang and Tianyi Zhang and Ruiqi
Zhong and Luke Zettlemoyer and Scott Wen-tau Yih and Daniel Fried and Sida
Wang and Tao Yu
- Abstract summary: DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow.
Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
- Score: 70.96868419971756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DS-1000, a code generation benchmark with a thousand data
science problems spanning seven Python libraries, such as NumPy and Pandas.
Compared to prior works, DS-1000 incorporates three core features. First, our
problems reflect diverse, realistic, and practical use cases since we collected
them from StackOverflow. Second, our automatic evaluation is highly specific
(reliable) -- across all Codex-002-predicted solutions that our evaluation
accept, only 1.8% of them are incorrect; we achieve this with multi-criteria
metrics, checking both functional correctness by running test cases and
surface-form constraints by restricting API usages or keywords. Finally, we
proactively defend against memorization by slightly modifying our problems to
be different from the original StackOverflow source; consequently, models
cannot answer them correctly by memorizing the solutions from pre-training. The
current best public system (Codex-002) achieves 43.3% accuracy, leaving ample
room for improvement. We release our benchmark at
https://ds1000-code-gen.github.io.
Related papers
- CodeContests-O: Powering LLMs via Feedback-Driven Iterative Test Case Generation [71.42965967582147]
Existing approaches attempt to synthesize test cases using Large Language Models (LLMs)<n>We propose a $textbfFeedback-Bench Iterative Framework$ for comprehensive test case construction.<n>Our dataset achieves an average True Positive Rate (TPR) of $89.37%$ and True Negative Rate (TNR) of $90.89%$, significantly outperforming the CodeContests and CodeContests+ by margins of $4.32%$ and $9.37%$, respectively.
arXiv Detail & Related papers (2026-01-20T07:32:44Z) - QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback [7.355017519768158]
We introduce QCoder Benchmark, an evaluation framework that assesses large language models (LLMs) on quantum programming.<n>Our benchmark supports evaluation using a quantum simulator environment beyond conventional Python execution.<n>Even advanced models like GPT-4o achieve only around 18.97% accuracy, highlighting the difficulty of the benchmark.<n>In contrast, reasoning-based models such as o3 reach up to 78% accuracy, outperforming averaged success rates of human-written codes.
arXiv Detail & Related papers (2025-10-30T03:27:35Z) - How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective [51.30005925128432]
evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task.<n>Existing benchmarks suffer from high computational costs, score inflation, and a bias towards trivial bugs over rare, critical faults.<n>We introduce a framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix.
arXiv Detail & Related papers (2025-10-09T18:29:24Z) - UQ: Assessing Language Models on Unsolved Questions [149.46593270027697]
We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange.<n>UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers.<n>The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers.
arXiv Detail & Related papers (2025-08-25T01:07:59Z) - CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance [18.886738819470086]
We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance.<n>Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues.<n>Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories.
arXiv Detail & Related papers (2025-07-14T17:19:00Z) - DSCodeBench: A Realistic Benchmark for Data Science Code Generation [16.227266086218425]
DSCodeBench is a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks.<n>It consists of 1,000 carefully constructed problems sourced from GitHub across ten widely used Python data science libraries.<n>Compared to the current state-of-the-art benchmark DS-1000, DSCodeBench offers a more challenging and representative testbed.
arXiv Detail & Related papers (2025-05-21T15:11:26Z) - GENCNIPPET: Automated Generation of Code Snippets for Supporting Programming Questions [5.176434782905268]
Software developers often ask questions on Technical Q&A forums like Stack Overflow (SO) to seek solutions to their programming-related problems.
Many questions miss required code snippets due to the lack of readily available code, time constraints, employer restrictions, confidentiality concerns, or uncertainty about what code to share.
GENCNIPPET will generate relevant code examples (when required) to support questions for their timely solutions.
arXiv Detail & Related papers (2025-04-22T22:07:40Z) - Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory [52.44029486173232]
Dynamic Cheatsheet (DC) is a lightweight framework that endows a black-box language model with a persistent, evolving memory.
DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time.
This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback.
arXiv Detail & Related papers (2025-04-10T17:57:33Z) - Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [9.48622608877252]
Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1 in solving Python coding problems.
REPOCOD is a code generation benchmark with 980 problems collected from 11 popular real-world projects.
Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation.
arXiv Detail & Related papers (2024-10-29T01:21:05Z) - Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages.
We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Bridging Precision and Confidence: A Train-Time Loss for Calibrating
Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions.
Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z) - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep
Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
Each baseline is a self-contained experiment pipeline with easily reusable and extendable components.
We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and
Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers.
We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs.
Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.