Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
- URL: http://arxiv.org/abs/2510.12047v1
- Date: Tue, 14 Oct 2025 01:12:37 GMT
- Title: Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
- Authors: Soohan Lim, Joonghyuk Hahn, Hyunwoo Park, Sang-Ki Ko, Yo-Sub Han,
- Abstract summary: PACT is a program assessment and contract-adherence evaluation framework.<n>It provides a comprehensive test-suite corpus focused on contract violations.<n>It enables a systematic analysis of code generation under varied prompting conditions.
- Score: 11.445615378917578
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate large language models (LLMs) with pass@k on functional correctness using well-formed inputs. However, they ignore a crucial aspect of real-world software: adherence to contracts-the preconditions and validity constraints that dictate how ill-formed inputs must be rejected. This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets. We introduce PACT, a program assessment and contract-adherence evaluation framework, to bridge this gap. PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets alongside functional correctness. PACT's contributions are threefold: First, it provides a comprehensive test-suite corpus focused on contract violations, extending HumanEval+ and MBPP+. Second, it enables a systematic analysis of code generation under varied prompting conditions. This analysis demonstrates that augmenting prompts with contract-violating test cases significantly enhance a model's ability to respect contracts compared to using contract description alone. Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation. By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics to evaluate the robustness of LLM-generated code snippets in both functionality and contract-adherence.Our code and data are available at https://github.com/suhanmen/PACT.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - Beyond Code Similarity: Benchmarking the Plausibility, Efficiency, and Complexity of LLM-Generated Smart Contracts [3.3672086394822762]
LLMs produce code with high semantic similarity to real contracts.<n>Only 20% to 26% of zero-shot generations behave identically to ground-truth implementations under testing.<n>Retrieval-Augmented Generation markedly improves performance, boosting functional correctness by up to 45%.
arXiv Detail & Related papers (2025-11-20T10:47:59Z) - Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification? [1.9551668880584971]
Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints.<n>We introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts.<n>We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond.
arXiv Detail & Related papers (2025-10-14T16:37:39Z) - SolContractEval: A Benchmark for Evaluating Contract-Level Solidity Code Generation [17.116315125396696]
Solidity is the primary language for smart contracts.<n>Existing evaluations fall short of assessing models' capabilities in real-world contract development.<n>We introduce SolContractEval, the first contract-level benchmark for Solidity code generation.
arXiv Detail & Related papers (2025-09-28T11:53:41Z) - Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z) - Alignment with Fill-In-the-Middle for Enhancing Code Generation [56.791415642365415]
We propose a novel approach that splits code snippets into smaller, granular blocks, creating more diverse DPO pairs from the same test cases.<n>Our approach demonstrates significant improvements in code generation tasks, as validated by experiments on benchmark datasets such as HumanEval (+), MBPP (+), APPS, LiveCodeBench, and BigCodeBench.
arXiv Detail & Related papers (2025-08-27T03:15:53Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - SolBench: A Dataset and Benchmark for Evaluating Functional Correctness in Solidity Code Completion and Repair [51.0686873716938]
We introduce SolBench, a benchmark for evaluating the functional correctness of Solidity smart contracts generated by code completion models.<n>We propose a Retrieval-Augmented Code Repair framework to verify functional correctness of smart contracts.<n>Results show that code repair and retrieval techniques effectively enhance the correctness of smart contract completion while reducing computational costs.
arXiv Detail & Related papers (2025-03-03T01:55:20Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Code Comment Inconsistency Detection with BERT and Longformer [9.378041196272878]
Comments, or natural language descriptions of source code, are standard practice among software developers.
When the code is modified without an accompanying correction to the comment, an inconsistency between the comment and code can arise.
We propose two models to detect such inconsistencies in a natural language inference (NLI) context.
arXiv Detail & Related papers (2022-07-29T02:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.