Related papers: SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior

URL: http://arxiv.org/abs/2410.16665v2
Date: Fri, 31 Jan 2025 18:01:12 GMT
Title: SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior
Authors: Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne G. E. Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine,
Abstract summary: We present SafetyAnalyst, a novel AI safety moderation framework.<n>Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.<n>It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
Score: 56.10557932893919
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ideal AI safety moderation system would be both structurally interpretable (so its decisions can be reliably explained) and steerable (to align to safety standards and reflect a community's values), which current systems fall short on. To address this gap, we present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences by creating a structured "harm-benefit tree," which enumerates harmful and beneficial actions and effects the AI behavior may lead to, along with likelihood, severity, and immediacy labels that describe potential impact on any stakeholders. SafetyAnalyst then aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters, which can be aligned to particular safety preferences. We applied this conceptual framework to develop, test, and release an open-source LLM prompt safety classification system, distilled from 18.5 million harm-benefit features generated by frontier LLMs on 19k prompts. On a comprehensive set of prompt safety benchmarks, we show that SafetyReporter (average F1=0.81) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional advantages of interpretability, transparency, and steerability.

Related papers

Automated Safety Evaluations Across 20 Large Language Models: The Aymara LLM Risk and Responsibility Matrix [0.0]
Aymara AI is a programmatic platform for generating and administering customized, policy-grounded safety evaluations.<n>It transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments.
arXiv Detail & Related papers (2025-07-19T18:49:16Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
A Domain-Agnostic Scalable AI Safety Ensuring Framework [8.086635708001166]
We propose a novel framework that guarantees AI systems satisfy user-defined safety constraints with specified probabilities.<n>Our approach combines any AI model with an optimization problem that ensures outputs meet safety requirements while maintaining performance.<n>We prove our method guarantees probabilistic safety under mild conditions and establish the first scaling law in AI safety.
arXiv Detail & Related papers (2025-04-29T16:38:35Z)
Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness. Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z)
Internal Activation as the Polar Star for Steering Unsafe LLM Behavior [50.463399903987245]
We introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
Locking Down the Finetuned LLMs Safety [33.56657036839617]
Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. Existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning.
arXiv Detail & Related papers (2024-10-14T09:58:29Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries. As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction. We identify four types of attribute-critical components in safety-aligned large language models (LLMs) Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models [8.024771725860127]
Jailbreak attacks manipulate large language models into generating harmful content. Jailbreak Antidote enables real-time adjustment of safety preferences by manipulating a sparse subset of the model's internal states. Our analysis reveals that safety-related information in LLMs is sparsely distributed.
arXiv Detail & Related papers (2024-10-03T08:34:17Z)
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context. We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z)
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models [5.6874111521946356]
Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms. We propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy. HarmEval is a novel benchmark for extensive safety evaluations.
arXiv Detail & Related papers (2024-06-18T05:03:23Z)
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA. textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [88.90694413503614]
We find that the safety alignment of LLMs can be compromised by fine-tuning. We jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples. We advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.
arXiv Detail & Related papers (2023-10-05T17:12:17Z)
Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
Fail-Safe Adversarial Generative Imitation Learning [9.594432031144716]
We propose a safety layer that enables a closed-form probability density/gradient of the safe generative continuous policy, end-to-end generative adversarial training, and worst-case safety guarantees. The safety layer maps all actions into a set of safe actions, and uses the change-of-variables formula plus additivity of measures for the density. In an experiment on real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.
arXiv Detail & Related papers (2022-03-03T13:03:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.