Related papers: RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

URL: http://arxiv.org/abs/2602.02280v1
Date: Mon, 02 Feb 2026 16:20:51 GMT
Title: RACA: Representation-Aware Coverage Criteria for LLM Safety Testing
Authors: Zeming Wei, Zhixin Zhang, Chengcan Wu, Yihao Zhang, Xiaokun Luan, Meng Sun,
Abstract summary: This paper introduces RACA, a novel set of coverage criteria specifically designed for AI safety testing.<n>We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization.<n>We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling.
Score: 13.729870450773797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in LLMs have led to significant breakthroughs in various AI applications. However, their sophisticated capabilities also introduce severe safety concerns, particularly the generation of harmful content through jailbreak attacks. Current safety testing for LLMs often relies on static datasets and lacks systematic criteria to evaluate the quality and adequacy of these tests. While coverage criteria have been effective for smaller neural networks, they are not directly applicable to LLMs due to scalability issues and differing objectives. To address these challenges, this paper introduces RACA, a novel set of coverage criteria specifically designed for LLM safety testing. RACA leverages representation engineering to focus on safety-critical concepts within LLMs, thereby reducing dimensionality and filtering out irrelevant information. The framework operates in three stages: first, it identifies safety-critical representations using a small, expert-curated calibration set of jailbreak prompts. Second, it calculates conceptual activation scores for a given test suite based on these representations. Finally, it computes coverage results using six sub-criteria that assess both individual and compositional safety concepts. We conduct comprehensive experiments to validate RACA's effectiveness, applicability, and generalization, where the results demonstrate that RACA successfully identifies high-quality jailbreak prompts and is superior to traditional neuron-level criteria. We also showcase its practical application in real-world scenarios, such as test set prioritization and attack prompt sampling. Furthermore, our findings confirm RACA's generalization to various scenarios and its robustness across various configurations. Overall, RACA provides a new framework for evaluating the safety of LLMs, contributing a valuable technique to the field of testing for AI.

Related papers

Is Your Prompt Poisoning Code? Defect Induction Rates and Security Mitigation Strategies [4.435429537888066]
Large language models (LLMs) have become indispensable for automated code generation, yet the quality and security of their outputs remain a critical concern.<n>We propose an evaluation framework for prompt quality encompassing three key dimensions: goal clarity, information completeness, and logical consistency.<n>Our findings highlight that enhancing the quality of user prompts constitutes a critical and effective strategy for strengthening the security of AI-generated code.
arXiv Detail & Related papers (2025-10-27T02:59:17Z)
ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z)
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z)
Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks [12.893445918647842]
Large Language Models (LLMs) demonstrate impressive capabilities across various fields, yet their increasing use raises critical security concerns. This article reviews recent literature addressing key issues in LLM security, with a focus on accuracy, bias, content detection, and vulnerability to attacks.
arXiv Detail & Related papers (2024-09-12T14:42:08Z)
Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks [10.909463767558023]
Large language models (LLMs) have revolutionized artificial intelligence, but their deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks.<n>In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies.<n>We develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak.
arXiv Detail & Related papers (2024-08-27T17:14:21Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.