SpecEval: Evaluating Model Adherence to Behavior Specifications
- URL: http://arxiv.org/abs/2509.02464v2
- Date: Wed, 22 Oct 2025 21:55:45 GMT
- Title: SpecEval: Evaluating Model Adherence to Behavior Specifications
- Authors: Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, Percy Liang,
- Abstract summary: We introduce an automated framework that audits models against their providers specifications.<n>Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges.<n>We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
- Score: 63.13000010340958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Companies that develop foundation models publish behavioral guidelines they pledge their models will follow, but it remains unclear if models actually do so. While providers such as OpenAI, Anthropic, and Google have published detailed specifications describing both desired safety constraints and qualitative traits for their models, there has been no systematic audit of adherence to these guidelines. We introduce an automated framework that audits models against their providers specifications by parsing behavioral statements, generating targeted prompts, and using models to judge adherence. Our central focus is on three way consistency between a provider specification, its model outputs, and its own models as judges; an extension of prior two way generator validator consistency. This establishes a necessary baseline: at minimum, a foundation model should consistently satisfy the developer behavioral specifications when judged by the developer evaluator models. We apply our framework to 16 models from six developers across more than 100 behavioral statements, finding systematic inconsistencies including compliance gaps of up to 20 percent across providers.
Related papers
- Towards the Formalization of a Trustworthy AI for Mining Interpretable Models explOiting Sophisticated Algorithms [4.587316936127635]
Interpretable-by-design models are crucial for fostering trust, accountability, and safe adoption of automated decision-making models in real-world applications.<n>We formalize a comprehensive methodology for generating predictive models that balance interpretability with performance.<n>By evaluating ethical measures during model generation, this framework establishes the theoretical foundations for developing AI systems.
arXiv Detail & Related papers (2025-10-23T14:54:33Z) - Stress-Testing Model Specs Reveals Character Differences among Language Models [23.505192393830807]
Large language models (LLMs) are increasingly trained from AI constitutions and model specifications.<n>We present a systematic methodology for stress-testing model character specifications.<n>We identify numerous cases of principle contradictions and interpretive ambiguities in current model specs.
arXiv Detail & Related papers (2025-10-09T02:24:37Z) - Confidence and Dispersity as Signals: Unsupervised Model Evaluation and Ranking [46.95596181965493]
This paper presents a unified and practical framework for unsupervised model evaluation and ranking.<n>We show that hybrid metrics consistently outperform single-aspect metrics on both dataset-centric and model-centric evaluation settings.
arXiv Detail & Related papers (2025-10-03T12:48:11Z) - Conformalized Exceptional Model Mining: Telling Where Your Model Performs (Not) Well [31.013018198280506]
This paper introduces a novel framework, Conformalized Exceptional Model Mining.<n>It combines the rigor of Conformal Prediction with the explanatory power of Exceptional Model Mining.<n>We develop a new model class, mSMoPE, which quantifies uncertainty through conformal prediction's rigorous coverage guarantees.
arXiv Detail & Related papers (2025-08-21T13:43:14Z) - Holmes: Towards Effective and Harmless Model Ownership Verification to Personalized Large Vision Models via Decoupling Common Features [54.63343151319368]
This paper proposes a harmless model ownership verification method for personalized models by decoupling similar common features.<n>In the first stage, we create shadow models that retain common features of the victim model while disrupting dataset-specific features.<n>After that, a meta-classifier is trained to identify stolen models by determining whether suspicious models contain the dataset-specific features of the victim.
arXiv Detail & Related papers (2025-06-24T15:40:11Z) - Delphos: A reinforcement learning framework for assisting discrete choice model specification [0.0]
We introduce Delphos, a deep reinforcement learning framework for assisting the discrete choice model specification process.<n>In this setting, an agent learns to specify well-performing model candidates by choosing a sequence of modelling actions.<n>We evaluate Delphos on both simulated and empirical datasets, varying the size of the modelling space and the reward function.
arXiv Detail & Related papers (2025-06-06T15:40:16Z) - Model Provenance Testing for Large Language Models [14.949325775620439]
We develop a framework for testing model provenance: Whether one model is derived from another.<n>Our approach is based on the key observation that real-world model derivations preserve significant similarities in model outputs.<n>Using only black-box access to models, we employ multiple hypothesis testing to compare model similarities against a baseline established by unrelated models.
arXiv Detail & Related papers (2025-02-02T07:39:37Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Did the Models Understand Documents? Benchmarking Models for Language
Understanding in Document-Level Relation Extraction [2.4665182280122577]
Document-level relation extraction (DocRE) attracts more research interest recently.
While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied.
In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model.
arXiv Detail & Related papers (2023-06-20T08:52:05Z) - Challenges to Evaluating the Generalization of Coreference Resolution Models: A Measurement Modeling Perspective [69.50044040291847]
We show how multi-dataset evaluations risk conflating different factors concerning what, precisely, is being measured.
This makes it difficult to draw more generalizable conclusions from these evaluations.
arXiv Detail & Related papers (2023-03-16T05:32:02Z) - Improving Label Quality by Jointly Modeling Items and Annotators [68.8204255655161]
We propose a fully Bayesian framework for learning ground truth labels from noisy annotators.
Our framework ensures scalability by factoring a generative, Bayesian soft clustering model over label distributions into the classic David and Skene joint annotator-data model.
arXiv Detail & Related papers (2021-06-20T02:15:20Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z) - DirectDebug: Automated Testing and Debugging of Feature Models [55.41644538483948]
Variability models (e.g., feature models) are a common way for the representation of variabilities and commonalities of software artifacts.
Complex and often large-scale feature models can become faulty, i.e., do not represent the expected variability properties of the underlying software artifact.
arXiv Detail & Related papers (2021-02-11T11:22:20Z) - Decentralized Attribution of Generative Models [35.80513184958743]
Decentralized attribution relies on binary classifiers associated with each user-end model.
We develop sufficient conditions of the keys that guarantee an attributability lower bound.
Our method is validated on MNIST, CelebA, and FFHQ datasets.
arXiv Detail & Related papers (2020-10-27T01:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.