NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
- URL: http://arxiv.org/abs/2405.04520v1
- Date: Tue, 7 May 2024 17:52:51 GMT
- Title: NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts
- Authors: Shudan Zhang, Hanlin Zhao, Xiao Liu, Qinkai Zheng, Zehan Qi, Xiaotao Gu, Xiaohan Zhang, Yuxiao Dong, Jie Tang,
- Abstract summary: We propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks.
NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services.
Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant.
- Score: 31.783388267874738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench.
Related papers
- CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously.
We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process.
Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z) - CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.
CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z) - MapCoder: Multi-Agent Code Generation for Competitive Problem Solving [3.3856216159724983]
We introduce a new approach to code generation tasks leveraging multi-agent prompting.
Our framework, MapCoder, consists of four LLM agents specifically designed to emulate the stages of program synthesis.
Our method consistently delivers superior performance across various programming languages.
arXiv Detail & Related papers (2024-05-18T22:10:15Z) - Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation [0.0]
Large Language Models (LLMs) have become a popular choice for many Natural Language Processing (NLP) tasks.
LLMs' substantial computational and memory requirements often make them inaccessible to users with limited resources.
This paper focuses on very low-cost models which offer a more accessible alternative to resource-intensive LLMs.
arXiv Detail & Related papers (2024-04-17T08:16:48Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z) - Fast Bayesian Optimization of Needle-in-a-Haystack Problems using
Zooming Memory-Based Initialization [73.96101108943986]
A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset.
We present a Zooming Memory-Based Initialization algorithm that builds on conventional Bayesian optimization principles.
arXiv Detail & Related papers (2022-08-26T23:57:41Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.