ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models
- URL: http://arxiv.org/abs/2310.09624v2
- Date: Sat, 11 Nov 2023 05:30:34 GMT
- Title: ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models
- Authors: Alex Mei, Sharon Levy, William Yang Wang
- Abstract summary: ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
- Score: 65.79770974145983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models are integrated into society, robustness toward a
suite of prompts is increasingly important to maintain reliability in a
high-variance environment.Robustness evaluations must comprehensively
encapsulate the various settings in which a user may invoke an intelligent
system. This paper proposes ASSERT, Automated Safety Scenario Red Teaming,
consisting of three methods -- semantically aligned augmentation, target
bootstrapping, and adversarial knowledge injection. For robust safety
evaluation, we apply these methods in the critical domain of AI safety to
algorithmically generate a test suite of prompts covering diverse robustness
settings -- semantic equivalence, related scenarios, and adversarial. We
partition our prompts into four safety domains for a fine-grained analysis of
how the domain affects model performance. Despite dedicated safeguards in
existing state-of-the-art models, we find statistically significant performance
differences of up to 11% in absolute classification accuracy among semantically
related scenarios and error rates of up to 19% absolute error in zero-shot
adversarial settings, raising concerns for users' physical safety.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.