Related papers: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

URL: http://arxiv.org/abs/2211.11501v1
Date: Fri, 18 Nov 2022 17:20:27 GMT
Title: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
Authors: Yuhang Lai and Chengxi Li and Yiming Wang and Tianyi Zhang and Ruiqi Zhong and Luke Zettlemoyer and Scott Wen-tau Yih and Daniel Fried and Sida Wang and Tao Yu
Abstract summary: DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-predicted solutions that our evaluation accept, only 1.8% of them are incorrect.
Score: 70.96868419971756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems reflect diverse, realistic, and practical use cases since we collected them from StackOverflow. Second, our automatic evaluation is highly specific (reliable) -- across all Codex-002-predicted solutions that our evaluation accept, only 1.8% of them are incorrect; we achieve this with multi-criteria metrics, checking both functional correctness by running test cases and surface-form constraints by restricting API usages or keywords. Finally, we proactively defend against memorization by slightly modifying our problems to be different from the original StackOverflow source; consequently, models cannot answer them correctly by memorizing the solutions from pre-training. The current best public system (Codex-002) achieves 43.3% accuracy, leaving ample room for improvement. We release our benchmark at https://ds1000-code-gen.github.io.

Related papers

GENCNIPPET: Automated Generation of Code Snippets for Supporting Programming Questions [5.176434782905268]
Software developers often ask questions on Technical Q&A forums like Stack Overflow (SO) to seek solutions to their programming-related problems. Many questions miss required code snippets due to the lack of readily available code, time constraints, employer restrictions, confidentiality concerns, or uncertainty about what code to share. GENCNIPPET will generate relevant code examples (when required) to support questions for their timely solutions.
arXiv Detail & Related papers (2025-04-22T22:07:40Z)
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory [52.44029486173232]
Dynamic Cheatsheet (DC) is a lightweight framework that endows a black-box language model with a persistent, evolving memory. DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback.
arXiv Detail & Related papers (2025-04-10T17:57:33Z)
Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' [9.48622608877252]
Large language models (LLMs) have achieved high accuracy, i.e., more than 90% pass@1 in solving Python coding problems. REPOCOD is a code generation benchmark with 980 problems collected from 11 popular real-world projects. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation.
arXiv Detail & Related papers (2024-10-29T01:21:05Z)
Generating Unseen Code Tests In Infinitum [1.0674604700001968]
We present a method for creating benchmark variations that generalize across coding tasks and programming languages. We implement one benchmark, called textitauto-regression, for the task of text-to-code generation in Python.
arXiv Detail & Related papers (2024-07-29T08:11:20Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection [58.789823426981044]
We propose a novel auxiliary loss formulation that aims to align the class confidence of bounding boxes with the accurateness of predictions. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios.
arXiv Detail & Related papers (2023-03-25T08:56:21Z)
Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
DeepDebug: Fixing Python Bugs Using Stack Traces, Backtranslation, and Code Skeletons [5.564793925574796]
We present an approach to automated debug using large, pretrained transformers. We start by training a bug-creation model on reversed commit data for the purpose of generating synthetic bugs. Next, we focus on 10K repositories for which we can execute tests, and create buggy versions of all functions that are covered by passing tests.
arXiv Detail & Related papers (2021-05-19T18:40:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.