Fugu-MT 論文翻訳(概要): AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

論文の概要: AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

arxiv url: http://arxiv.org/abs/2603.09435v1
Date: Tue, 10 Mar 2026 09:47:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.213907
Title: AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems
Title（参考訳）: AI Act Evaluation Benchmark: NLPおよびRAGシステムのためのオープンで透明で再現可能な評価データセット
Authors: Athanasios Davvetas, Michael Papademas, Xenia Ziouvelou, Vangelis Karkaletsis,
Abstract要約: 不均一な公共および社会的セクターにおけるAIの急速な展開は、規制標準やフレームワークへのコンプライアンスの必要性を増大させてきた。このような標準に対するAIシステムのコンプライアンスレベルを引き出すソリューションの開発は、リソース不足によって制限されることが多い。本稿では、NLPモデルの評価を容易にするリソースを作成するための、オープンで透明で再現可能な手法を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid rollout of AI in heterogeneous public and societal sectors has subsequently escalated the need for compliance with regulatory standards and frameworks. The EU AI Act has emerged as a landmark in the regulatory landscape. The development of solutions that elicit the level of AI systems' compliance with such standards is often limited by the lack of resources, hindering the semi-automated or automated evaluation of their performance. This generates the need for manual work, which is often error-prone, resource-limited or limited to cases not clearly described by the regulation. This paper presents an open, transparent, and reproducible method of creating a resource that facilitates the evaluation of NLP models with a strong focus on RAG systems. We have developed a dataset that contain the tasks of risk-level classification, article retrieval, obligation generation, and question-answering for the EU AI Act. The dataset files are in a machine-to-machine appropriate format. To generate the files, we utilise domain knowledge as an exegetical basis, combining with the processing and reasoning power of large language models to generate scenarios along with the respective tasks. Our methodology demonstrates a way to harness language models for grounded generation with high document relevancy. Besides, we overcome limitations such as navigating the decision boundaries of risk-levels that are not explicitly defined within the EU AI Act, such as limited and minimal cases. Finally, we demonstrate our dataset's effectiveness by evaluating a RAG-based solution that reaches 0.87 and 0.85 F1-score for prohibited and high-risk scenarios.
Abstract（参考訳）: 異種公共部門と社会的セクターにおけるAIの急速な展開は、その後、規制標準やフレームワークへのコンプライアンスの必要性を増大させた。 EU AI法は規制のランドマークとして浮上している。このような標準に対するAIシステムのコンプライアンスレベルを損なうソリューションの開発は、リソースの欠如によってしばしば制限され、パフォーマンスの半自動または自動評価を妨げる。これは、しばしばエラーが発生し、リソースが制限され、規制によって明確に説明されていないケースに限られる手作業の必要性を生じさせる。本稿では、RAGシステムに強く焦点をあてたNLPモデルの評価を容易にするリソースを作成する、オープンで透明で再現可能な手法を提案する。我々は、リスクレベル分類、記事検索、義務生成、EU AI法に対する質問応答といったタスクを含むデータセットを開発した。データセットファイルは、マシンからマシンまでの適切なフォーマットである。ファイルを生成するために,大規模言語モデルの処理能力と推論能力を組み合わせて,ドメイン知識をエクセジカルベースとして利用し,各タスクのシナリオを生成する。提案手法は,高ドキュメント関連性を有する基底生成のための言語モデルを活用する方法を示す。さらに、制限されたケースや最小限のケースなど、EU AI法で明確に定義されていないリスクレベルの決定境界をナビゲートするといった制限を克服しています。最後に、禁止およびリスクの高いシナリオに対して0.87と0.85F1スコアに達するRAGベースのソリューションを評価することで、データセットの有効性を実証する。

論文の概要: AI Act Evaluation Benchmark: An Open, Transparent, and Reproducible Evaluation Dataset for NLP and RAG Systems

関連論文リスト