Related papers: ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

URL: http://arxiv.org/abs/2402.16444v2
Date: Tue, 05 Nov 2024 02:13:59 GMT
Title: ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Authors: Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang,
Abstract summary: ShieldLM is a safety detector for Large Language Models (LLMs) that aligns with common safety standards. We show that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability.
Score: 90.73444232283371
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs' responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at \url{https://github.com/thu-coai/ShieldLM} to support accurate and explainable safety detection under various safety standards.

Related papers

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment [24.364891513019444]
In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface.<n>We propose LARF, a Layer-Aware Representation Filtering method.<n> Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features.
arXiv Detail & Related papers (2025-07-24T17:59:24Z)
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs? [0.836362570897926]
We investigate existing methods for such generalization and find them insufficient. To avoid performance degradation and preserve safe performance, we advocate for a two-step framework. We find that the final hidden state for the last token is enough to provide robust performance.
arXiv Detail & Related papers (2025-02-22T10:31:50Z)
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models [26.667869862556973]
We introduce U-SAFEBENCH, the first benchmark to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards. We propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety.
arXiv Detail & Related papers (2025-02-20T22:58:44Z)
VLM-Guard: Safeguarding Vision-Language Models via Fulfilling Safety Alignment Gap [51.287157951953226]
Vision language models (VLMs) come with increased safety concerns. VLMs can be built upon LLMs that have textual safety alignment, but it is easily undermined when the vision modality is integrated. We propose VLM-Guard, an inference-time intervention strategy that leverages the LLM component of a VLM as supervision for the safety alignment of the VLM.
arXiv Detail & Related papers (2025-02-14T08:44:43Z)
ASTRAL: Automated Safety Testing of Large Language Models [6.1050306667733185]
Large Language Models (LLMs) have recently gained attention due to their ability to understand and generate sophisticated human-like content. We present ASTRAL, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs.
arXiv Detail & Related papers (2025-01-28T18:25:11Z)
Safety Layers in Aligned Large Language Models: The Key to LLM Security [43.805905164456846]
Internal parameters can be vulnerable to security degradation when fine-tuned with non-malicious backdoor or normal data. We identify a small set of contiguous layers in the middle of the model that are crucial for distinguishing malicious queries from normal ones. We propose a novel fine-tuning approach, Safely Partial Fine-Tuning (SPPFT), that fixes the gradient of the safety layers during fine-tuning to address the security degradation.
arXiv Detail & Related papers (2024-08-30T04:35:59Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability [25.750371424096436]
Large Language Models (LLMs) are increasingly deployed in various applications. Our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance. We introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability.
arXiv Detail & Related papers (2024-05-23T12:19:59Z)
Towards Comprehensive Post Safety Alignment of Large Language Models via Safety Patching [74.62818936088065]
textscSafePatching is a novel framework for comprehensive PSA. textscSafePatching achieves a more comprehensive PSA than baseline methods. textscSafePatching demonstrates its superiority in continual PSA scenarios.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation [98.02846901473697]
We propose ECSO (Eyes Closed, Safety On), a training-free protecting approach that exploits the inherent safety awareness of MLLMs. ECSO generates safer responses via adaptively transforming unsafe images into texts to activate the intrinsic safety mechanism of pre-aligned LLMs.
arXiv Detail & Related papers (2024-03-14T17:03:04Z)
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs) It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
Safety of Multimodal Large Language Models on Images and Texts [33.97489213223888]
In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety.
arXiv Detail & Related papers (2024-02-01T05:57:10Z)
SafetyBench: Evaluating the Safety of Large Language Models [54.878612385780805]
SafetyBench is a comprehensive benchmark for evaluating the safety of Large Language Models (LLMs) It comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Our tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts.
arXiv Detail & Related papers (2023-09-13T15:56:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.