Related papers: Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

URL: http://arxiv.org/abs/2508.05464v2
Date: Fri, 08 Aug 2025 14:16:34 GMT
Title: Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Authors: Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti,
Abstract summary: Current AI evaluation practices depend heavily on established benchmarks.<n>This research addresses the urgent need to quantify this "benchmark-regulation gap"<n>Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities.
Score: 2.010294990327175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities. On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability", while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This study provides the first comprehensive, quantitative analysis of this gap, demonstrating that current public benchmarks are insufficient, on their own, for providing the evidence of comprehensive risk assessment required for regulatory compliance and offering critical insights for the development of next-generation evaluation tools.

Related papers

Agentic AI for Commercial Insurance Underwriting with Adversarial Self-Critique [0.0]
This study presents a decision-negative, human-in-the-loop agentic system that incorporates an adversarial self-critique mechanism.<n>Within this system, a critic agent challenges the primary agent's conclusions prior to submitting recommendations to human reviewers.<n>The research develops a formal taxonomy of failure modes to characterize potential errors by decision-negative agents.
arXiv Detail & Related papers (2026-01-21T05:51:27Z)
BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search [72.87861928940929]
Boundary-Aware Policy Optimization (BAPO) is a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy.<n>BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut.
arXiv Detail & Related papers (2026-01-16T07:06:58Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models [0.0]
We introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs)<n>We present a bias benchmark for LLMs that measure performance across 29 distinct metrics.<n>These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.
arXiv Detail & Related papers (2025-03-31T16:56:52Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Safety Evaluation of DeepSeek Models in Chinese Contexts [12.297396865203973]
This study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark.<n>This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts.<n>The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements.
arXiv Detail & Related papers (2025-02-16T14:05:54Z)
Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards [5.388550452190688]
This paper audits and quantifies security risks in three major AI governance standards: NIST AI RMF 1.0, UK's AI and Data Protection Risk Toolkit, and the EU's ALTAI.<n>Using a novel risk assessment methodology, we develop four key metrics: Risk Severity Index (RSI), Attack Potential Index (AVPI), Compliance-Security Gap Percentage (CSGP), and Root Cause Vulnerability Score (RCVS)
arXiv Detail & Related papers (2025-02-12T17:57:54Z)
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices.<n>It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues.<n>Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z)
Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance [2.3020018305241337]
The paper emphasizes the critical gap in the evaluation of Explainable AI (XAI) due to the lack of standardized and reliable metrics.<n>Current evaluation methods are often fragmented, subjective, and biased, making them prone to manipulation and complicating the assessment of complex models.<n>We advocate for widespread research into developing robust, context-sensitive evaluation metrics.
arXiv Detail & Related papers (2025-02-07T06:54:48Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
Peer-induced Fairness: A Causal Approach for Algorithmic Fairness Auditing [0.0]
The European Union's Artificial Intelligence Act takes effect on 1 August 2024. High-risk AI applications must adhere to stringent transparency and fairness standards. We propose a novel framework, which combines the strengths of counterfactual fairness and peer comparison strategy.
arXiv Detail & Related papers (2024-08-05T15:35:34Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Towards a multi-stakeholder value-based assessment framework for algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values. We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.