ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models
- URL: http://arxiv.org/abs/2310.09624v2
- Date: Sat, 11 Nov 2023 05:30:34 GMT
- Title: ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models
- Authors: Alex Mei, Sharon Levy, William Yang Wang
- Abstract summary: ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
- Score: 65.79770974145983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models are integrated into society, robustness toward a
suite of prompts is increasingly important to maintain reliability in a
high-variance environment.Robustness evaluations must comprehensively
encapsulate the various settings in which a user may invoke an intelligent
system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming,
consisting of three methods -- semantically aligned augmentation, target
bootstrapping, and adversarial knowledge injection. For robust safety
evaluation, we apply these methods in the critical domain of AI safety to
algorithmically generate a test suite of prompts covering diverse robustness
settings -- semantic equivalence, related scenarios, and adversarial. We
partition our prompts into four safety domains for a fine-grained analysis of
how the domain affects model performance. Despite dedicated safeguards in
existing state-of-the-art models, we find statistically significant performance
differences of up to 11% in absolute classification accuracy among semantically
related scenarios and error rates of up to 19% absolute error in zero-shot
adversarial settings, raising concerns for users' physical safety.
Related papers
- Towards Million-Scale Adversarial Robustness Evaluation With Stronger Individual Attacks [26.422616504640786]
We propose a novel individual attack method, Probability Margin Attack (PMA), which defines the adversarial margin in the probability space rather than the logits space.
We create a million-scale dataset, CC1M, and use it to conduct the first million-scale adversarial robustness evaluation of adversarially-trained ImageNet models.
arXiv Detail & Related papers (2024-11-20T10:41:23Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Towards Precise Observations of Neural Model Robustness in Classification [2.127049691404299]
In deep learning applications, robustness measures the ability of neural models that handle slight changes in input data.
Our approach contributes to a deeper understanding of model robustness in safety-critical applications.
arXiv Detail & Related papers (2024-04-25T09:37:44Z) - Dynamic Vulnerability Criticality Calculator for Industrial Control Systems [0.0]
This paper introduces an innovative approach by proposing a dynamic vulnerability criticality calculator.
Our methodology encompasses the analysis of environmental topology and the effectiveness of deployed security mechanisms.
Our approach integrates these factors into a comprehensive Fuzzy Cognitive Map model, incorporating attack paths to holistically assess the overall vulnerability score.
arXiv Detail & Related papers (2024-03-20T09:48:47Z) - Unveiling Safety Vulnerabilities of Large Language Models [4.562678399685183]
This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ.
We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it.
We introduce a novel automatic approach for identifying and naming vulnerable semantic regions.
arXiv Detail & Related papers (2023-11-07T16:50:33Z) - From Adversarial Arms Race to Model-centric Evaluation: Motivating a
Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs.
The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples.
We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z) - Distributional Instance Segmentation: Modeling Uncertainty and High
Confidence Predictions with Latent-MaskRCNN [77.0623472106488]
In this paper, we explore a class of distributional instance segmentation models using latent codes.
For robotic picking applications, we propose a confidence mask method to achieve the high precision necessary.
We show that our method can significantly reduce critical errors in robotic systems, including our newly released dataset of ambiguous scenes.
arXiv Detail & Related papers (2023-05-03T05:57:29Z) - Increasing the Confidence of Deep Neural Networks by Coverage Analysis [71.57324258813674]
This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model against different unsafe inputs.
Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs.
arXiv Detail & Related papers (2021-01-28T16:38:26Z) - Multimodal Safety-Critical Scenarios Generation for Decision-Making
Algorithms Evaluation [23.43175124406634]
Existing neural network-based autonomous systems are shown to be vulnerable against adversarial attacks.
We propose a flow-based multimodal safety-critical scenario generator for evaluating decisionmaking algorithms.
We evaluate six Reinforcement Learning algorithms with our generated traffic scenarios and provide empirical conclusions about their robustness.
arXiv Detail & Related papers (2020-09-16T15:16:43Z) - A general framework for defining and optimizing robustness [74.67016173858497]
We propose a rigorous and flexible framework for defining different types of robustness properties for classifiers.
Our concept is based on postulates that robustness of a classifier should be considered as a property that is independent of accuracy.
We develop a very general robustness framework that is applicable to any type of classification model.
arXiv Detail & Related papers (2020-06-19T13:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.