Related papers: AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

URL: http://arxiv.org/abs/2510.01474v2
Date: Mon, 13 Oct 2025 01:10:53 GMT
Title: AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance
Authors: Bill Marino, Rosco Hunter, Zubair Jamali, Marinos Emmanouil Kalpakos, Mudra Kashyap, Isaiah Hinton, Alexa Hanson, Maahum Nazir, Christoph Schnabl, Felix Steffek, Hongkai Wen, Nicholas D. Lane,
Abstract summary: There is growing interest in using Large Language Models (LLMs) to assess whether an AI system complies with a given AI Regulation (AIR)<n>We introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA)
Score: 10.49637840194233
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams [0.0]
We evaluate ten leading open and closed large language models (LLMs)<n>Our findings show that several frontier models consistently achieve scores exceeding the standards for professional human certification.<n>This paper provides an overview for professionals navigating the intersection of AI advancement and regulatory risk.
arXiv Detail & Related papers (2025-08-12T15:57:22Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
MoRE-LLM: Mixture of Rule Experts Guided by a Large Language Model [54.14155564592936]
We propose a Mixture of Rule Experts guided by a Large Language Model (MoRE-LLM)<n>MoRE-LLM steers the discovery of local rule-based surrogates during training and their utilization for the classification task.<n>LLM is responsible for enhancing the domain knowledge alignment of the rules by correcting and contextualizing them.
arXiv Detail & Related papers (2025-03-26T11:09:21Z)
Can We Trust AI Agents? A Case Study of an LLM-Based Multi-Agent System for Ethical AI [10.084913433923566]
AI-based systems impact millions by supporting diverse tasks but face issues like misinformation, bias, and misuse.<n>This study examines the use of Large Language Models (LLM) for AI ethics in practice.<n>We design a prototype, where agents engage in structured discussions on real-world AI ethics issues from the AI Incident Database.
arXiv Detail & Related papers (2024-10-25T20:17:59Z)
Training of Scaffolded Language Models with Language Supervision: A Survey [62.59629932720519]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z)
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation [63.611024451010316]
Following natural instructions is crucial for the effective application of Retrieval-Augmented Generation (RAG) systems. We propose VIF-RAG, the first automated, scalable, and verifiable synthetic pipeline for instruction-following alignment in RAG systems.
arXiv Detail & Related papers (2024-10-12T16:30:51Z)
COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act [40.233017376716305]
The EU's Artificial Intelligence Act (AI Act) is a significant step towards responsible AI development.<n>It lacks clear technical interpretation, making it difficult to assess models' compliance.<n>This work presents COMPL-AI, a comprehensive framework consisting of the first technical interpretation of the Act.
arXiv Detail & Related papers (2024-10-10T14:23:51Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
A Preliminary Study on Using Large Language Models in Software Pentesting [2.0551676463612636]
Large language models (LLM) are perceived to offer promising potentials for automating security tasks. We investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code.
arXiv Detail & Related papers (2024-01-30T21:42:59Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Counter Turing Test CT^2: AI-Generated Text Detection is Not as Easy as You May Think -- Introducing AI Detectability Index [9.348082057533325]
AI-generated text detection (AGTD) has emerged as a topic that has already received immediate attention in research. This paper introduces the Counter Turing Test (CT2), a benchmark consisting of techniques aiming to offer a comprehensive evaluation of the fragility of existing AGTD techniques.
arXiv Detail & Related papers (2023-10-08T06:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.