Fugu-MT 論文翻訳(概要): RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

論文の概要: RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

arxiv url: http://arxiv.org/abs/2602.02280v1
Date: Mon, 02 Feb 2026 16:20:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.284551
Title: RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Title（参考訳）: RACA:LLM安全性テストのための表現対応カバレッジ基準
Authors: Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun,
Abstract要約: 本稿では,AIの安全性テストに特化して設計された,新しいカバレッジ基準であるRACAを紹介する。 RACAの有効性,適用性,一般化を検証するための総合的な実験を行った。また、テストセットの優先順位付けやアタックプロンプトサンプリングといった実世界のシナリオにおいて、実際の応用を示す。
参考スコア（独自算出の注目度）: 13.729870450773797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.
Abstract（参考訳）: LLMの最近の進歩は、様々なAIアプリケーションに大きなブレークスルーをもたらした。しかし、その高度な能力は、特にジェイルブレイク攻撃による有害なコンテンツの生成に深刻な安全上の懸念をもたらす。 LLMの現在の安全性テストは静的なデータセットに依存しており、これらのテストの品質と妥当性を評価するための体系的な基準が欠如している。カバレッジ基準は、より小さなニューラルネットワークに対して有効であるが、スケーラビリティの問題と異なる目的のために、LSMに直接適用できない。これらの課題に対処するために,本論文では,LLMの安全性試験に特化して設計された新しいカバレッジ基準であるRACAを紹介する。 RACAは表現工学を活用して、LLM内の安全クリティカルな概念に焦点を合わせ、次元を減らし、無関係な情報をフィルタリングする。フレームワークは3つの段階で動作する。まず、専門家がキュレーションしたジェイルブレイクプロンプトの小さなキャリブレーションセットを使用して、安全クリティカルな表現を識別する。次に、これらの表現に基づいて、所定のテストスイートに対する概念的アクティベーションスコアを算出する。最後に、個々の安全概念と構成安全概念の両方を評価する6つのサブ基準を用いて、カバレッジ結果を算出する。我々は、RACAの有効性、適用性、一般化を検証するための総合的な実験を行い、RACAは高品質のジェイルブレイクプロンプトを同定し、従来のニューロンレベルの基準よりも優れていることを示した。また、テストセットの優先順位付けやアタックプロンプトサンプリングといった実世界のシナリオにおいて、実際の応用を示す。さらに,RACAの様々なシナリオへの一般化と,様々な構成における堅牢性を確認した。全体として、RACAはLLMの安全性を評価するための新しいフレームワークを提供し、AIのテスト分野に貴重な技術を提供しています。

論文の概要: RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

関連論文リスト