Related papers: NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

URL: http://arxiv.org/abs/2401.15963v2
Date: Fri, 2 Feb 2024 18:11:27 GMT
Title: NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
Authors: Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, Aditya Kanade
Abstract summary: Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements.
Score: 11.26732084588476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code. In real-world software engineering, developers think beyond functional correctness. They have requirements on "how" a functionality should be implemented to meet overall system design objectives like efficiency, security, and maintainability. They would also trust the code LMs more if the LMs demonstrate robust understanding of requirements and code semantics. We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements. We propose a prompting method, Coding Concepts (CoCo), as a way for a developer to communicate the domain knowledge to the LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is that they generally falter when tested on our benchmark, hinting at fundamental blindspots in their training setups. Surprisingly, even the classification accuracy on functional-correctness instances derived from the popular HumanEval benchmark is low, calling in question the depth of their comprehension and the source of their success in generating functionally-correct code in the first place. We will release our benchmark and evaluation scripts publicly at https://aka.ms/NoFunEval.

Related papers

How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis. Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations. We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z)
How Far Have We Gone in Stripped Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Mercury: A Code Efficiency Benchmark for Code Large Language Models [41.51235610016959]
We present Mercury, the first code efficiency benchmark for Large Language Models for Code (Code LLMs) It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines. We introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously.
arXiv Detail & Related papers (2024-02-12T17:53:22Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator [115.16975276693267]
We propose Chain of Code, a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch.
arXiv Detail & Related papers (2023-12-07T17:51:43Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Test-Case-Driven Programming Understanding in Large Language Models for Better Code Generation [15.166827643436346]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs) It first exploits test case analysis to obtain specification understanding and enables a self-improvement process. muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z)
LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback. Our focus is the code generation task, where the model produces code based on natural language instructions. LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z)
CodeScore: Evaluating Code Generation by Learning Code Execution [34.08307174529496]
We propose CodeScore, a large language model (LLM)-based CEM, which estimates the functional correctness of generated code on three input formats. CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.
arXiv Detail & Related papers (2023-01-22T02:59:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.