NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional
Correctness
- URL: http://arxiv.org/abs/2401.15963v2
- Date: Fri, 2 Feb 2024 18:11:27 GMT
- Title: NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional
Correctness
- Authors: Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan,
Aditya Kanade
- Abstract summary: Existing evaluation benchmarks of language models of code (code LMs) focus almost exclusively on whether the LMs can generate functionally-correct code.
We propose a new benchmark NoFunEval to evaluate code LMs on non-functional requirements and simple classification instances for both functional and non-functional requirements.
- Score: 11.26732084588476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing evaluation benchmarks of language models of code (code LMs) focus
almost exclusively on whether the LMs can generate functionally-correct code.
In real-world software engineering, developers think beyond functional
correctness. They have requirements on "how" a functionality should be
implemented to meet overall system design objectives like efficiency, security,
and maintainability. They would also trust the code LMs more if the LMs
demonstrate robust understanding of requirements and code semantics.
We propose a new benchmark NoFunEval to evaluate code LMs on non-functional
requirements and simple classification instances for both functional and
non-functional requirements. We propose a prompting method, Coding Concepts
(CoCo), as a way for a developer to communicate the domain knowledge to the
LMs. We conduct an extensive evaluation of twenty-two code LMs. Our finding is
that they generally falter when tested on our benchmark, hinting at fundamental
blindspots in their training setups. Surprisingly, even the classification
accuracy on functional-correctness instances derived from the popular HumanEval
benchmark is low, calling in question the depth of their comprehension and the
source of their success in generating functionally-correct code in the first
place. We will release our benchmark and evaluation scripts publicly at
https://aka.ms/NoFunEval.
Related papers
- MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Evaluating Large Language Models on Non-Code Software Engineering Tasks [4.381476817430934]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation.<n>We present the first comprehensive benchmark, which we name Software Engineering Language Understanding' (SELU)<n>SELU covers classification, regression, Named Entity Recognition (NER) and Masked Language Modeling (MLM) targets, with data drawn from diverse sources.
arXiv Detail & Related papers (2025-06-12T15:52:32Z) - SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.
We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.
Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z) - CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation [20.013757490442064]
We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions.<n>CodeIF encompasses a broad range of tasks, including function synthesis, algorithmic instructions, and code explanation.<n>We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
arXiv Detail & Related papers (2025-02-26T14:19:49Z) - How Efficient is LLM-Generated Code? A Rigorous & High-Standard Benchmark [39.13045037676502]
Development of large language models (LLMs) has significantly pushed the frontiers of program synthesis.
Most evaluation frameworks focus on the (functional) correctness of generated code; efficiency, as an important measure of code quality, has been overlooked in existing evaluations.
We develop ENAMEL, a rigorous and high-standard benchmark for evaluating the capability of LLMs in generating efficient code.
arXiv Detail & Related papers (2024-06-10T04:19:20Z) - BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models [2.2863439039616127]
Probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training.
Previous approaches rely on the objective function used in pre-training LMs.
We propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement.
arXiv Detail & Related papers (2024-04-05T14:13:55Z) - Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities.
Code reasoning is one of the most essential abilities of code LLMs.
We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - CodeMind: Evaluating Large Language Models for Code Reasoning [6.819757372634151]
Large Language Models (LLMs) have been widely used to automate programming tasks.<n>This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-02-15T02:24:46Z) - Mercury: A Code Efficiency Benchmark for Code Large Language Models [41.51235610016959]
We present Mercury, the first code efficiency benchmark for Large Language Models for Code (Code LLMs)
It comprises 1,889 Python tasks, each accompanied by adequate solutions that serve as real-world efficiency baselines.
We introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and code efficiency simultaneously.
arXiv Detail & Related papers (2024-02-12T17:53:22Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - Chain of Code: Reasoning with a Language Model-Augmented Code Emulator [115.16975276693267]
We propose Chain of Code, a simple yet surprisingly effective extension that improves LM code-driven reasoning.
The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch.
arXiv Detail & Related papers (2023-12-07T17:51:43Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - Large Language Models for Code: Security Hardening and Adversarial Testing [6.19238492410992]
Large language models (large LMs) are increasingly trained on massive vectors and used to generate code.
This work studies the security of LMs along two important axes: (i) security hardening, which aims to enhance LMs' reliability in generating secure code, and (ii) adversarial testing, which seeks to evaluate LMs' security at an adversarial standpoint.
arXiv Detail & Related papers (2023-02-10T15:28:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.