Related papers: Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?

URL: http://arxiv.org/abs/2508.05464v1
Date: Thu, 07 Aug 2025 15:03:39 GMT
Title: Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Authors: Matteo Prandi, Vincenzo Suriani, Federico Pierucci, Marcello Galisai, Daniele Nardi, Piercosma Bisconti,
Abstract summary: Current AI evaluation practices depend heavily on established benchmarks.<n>These tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape.<n>This research addresses the urgent need to quantify this "benchmark-regulation gap"
Score: 2.010294990327175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of General Purpose AI (GPAI) models necessitates robust evaluation frameworks, especially with emerging regulations like the EU AI Act and its associated Code of Practice (CoP). Current AI evaluation practices depend heavily on established benchmarks, but these tools were not designed to measure the systemic risks that are the focus of the new regulatory landscape. This research addresses the urgent need to quantify this "benchmark-regulation gap." We introduce Bench-2-CoP, a novel, systematic framework that uses validated LLM-as-judge analysis to map the coverage of 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. Our findings reveal a profound misalignment: the evaluation ecosystem is overwhelmingly focused on a narrow set of behavioral propensities, such as "Tendency to hallucinate" (53.7% of the corpus) and "Discriminatory bias" (28.9%), while critical functional capabilities are dangerously neglected. Crucially, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus. This translates to a near-total evaluation gap for systemic risks like "Loss of Control" (0.4% coverage) and "Cyber Offence" (0.8% coverage). This study provides the first comprehensive, quantitative analysis of this gap, offering critical insights for policymakers to refine the CoP and for developers to build the next generation of evaluation tools, ultimately fostering safer and more compliant AI.

Related papers

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Safety Evaluation of DeepSeek Models in Chinese Contexts [12.297396865203973]
This study introduces CHiSafetyBench, a Chinese-specific safety evaluation benchmark.<n>This benchmark systematically evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts.<n>The experimental results quantify the deficiencies of these two models in Chinese contexts, providing key insights for subsequent improvements.
arXiv Detail & Related papers (2025-02-16T14:05:54Z)
Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards [5.388550452190688]
This paper audits and quantifies security risks in three major AI governance standards: NIST AI RMF 1.0, UK's AI and Data Protection Risk Toolkit, and the EU's ALTAI.<n>Using a novel risk assessment methodology, we develop four key metrics: Risk Severity Index (RSI), Attack Potential Index (AVPI), Compliance-Security Gap Percentage (CSGP), and Root Cause Vulnerability Score (RCVS)
arXiv Detail & Related papers (2025-02-12T17:57:54Z)
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices.<n>It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues.<n>Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z)
Peer-induced Fairness: A Causal Approach for Algorithmic Fairness Auditing [0.0]
The European Union's Artificial Intelligence Act takes effect on 1 August 2024. High-risk AI applications must adhere to stringent transparency and fairness standards. We propose a novel framework, which combines the strengths of counterfactual fairness and peer comparison strategy.
arXiv Detail & Related papers (2024-08-05T15:35:34Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Towards a multi-stakeholder value-based assessment framework for algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values. We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.