Related papers: CIFE: Code Instruction-Following Evaluation

CIFE: Code Instruction-Following Evaluation

URL: http://arxiv.org/abs/2512.17387v1
Date: Fri, 19 Dec 2025 09:43:20 GMT
Title: CIFE: Code Instruction-Following Evaluation
Authors: Sravani Gunnu, Shanmukha Guttula, Hima Patel,
Abstract summary: We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories.<n>We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance.<n>Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%.
Score: 3.941243815951084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.

Related papers

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models [5.733004743054914]
Large Reasoning Models (LRMs) exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process.<n>We introduce a formal framework for reasoning faithfulness, defined by two testable conditions.<n>We present RFEval, a benchmark of 7,186 instances that probes faithfulness via controlled, output-level counterfactual interventions.
arXiv Detail & Related papers (2026-02-19T03:49:37Z)
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs [49.61011897610774]
How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
arXiv Detail & Related papers (2026-02-09T15:47:14Z)
Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents [0.7699235580548228]
LLM agents struggle with regulatory audit replay: when asked to reproduce a transaction flagged decision with identical inputs, most deployments fail to return consistent results.<n>This paper introduces the DeterminismFaithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
arXiv Detail & Related papers (2026-01-17T19:47:55Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
SpecEval: Evaluating Model Adherence to Behavior Specifications [63.13000010340958]
We introduce an automated framework that audits models against their providers specifications.<n>Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges.<n>We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
arXiv Detail & Related papers (2025-09-02T16:18:40Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness [10.220692937750295]
Reliable Score ( RSS) is a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean.<n>We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
arXiv Detail & Related papers (2025-06-06T09:37:45Z)
Conformal Arbitrage: Risk-Controlled Balancing of Competing Objectives in Language Models [5.294604210205507]
We introduce Conformal Arbitrage, a framework that learns a data driven threshold to mediate between a Primary model optimized for a primary objective and a more conservative Guardian.<n>We observe that our method outperforms, in terms of accuracy, cost matched random routing between models.
arXiv Detail & Related papers (2025-06-01T08:55:10Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Probabilistic Consensus through Ensemble Validation: A Framework for LLM Reliability [0.0]
Large Language Models (LLMs) have shown significant advances in text generation but often lack the reliability needed for autonomous deployment. We introduce a novel framework that repurposes ensemble methods for content validation through model consensus. In tests across 78 complex cases requiring factual accuracy and causal consistency, our framework improved precision from 73.1% to 93.9%.
arXiv Detail & Related papers (2024-11-10T17:32:16Z)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory. We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z)
On the Limitations of Embedding Based Methods for Measuring Functional Correctness for Code Generation [4.065344017083881]
We analyze the ability of embedding-based metrics like CodeBERTScore to measure functional correctness and other helpful constructs like editing effort. Our results show that while they have a weak correlation with functional correctness (0.16), they are strongly correlated (0.72) with editing effort.
arXiv Detail & Related papers (2024-04-26T15:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.