SLM as Guardian: Pioneering AI Safety with Small Language Models
- URL: http://arxiv.org/abs/2405.19795v1
- Date: Thu, 30 May 2024 08:03:15 GMT
- Title: SLM as Guardian: Pioneering AI Safety with Small Language Models
- Authors: Ohjoon Kwon, Donghyeon Jeon, Nayoung Choi, Gyu-Hwung Cho, Changbong Kim, Hyunwoo Lee, Inho Kang, Sun Kim, Taiwoo Park,
- Abstract summary: Internalizing safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation.
We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
- Score: 6.799423428734095
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of humans. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. To overcome such challenges, a modular approach employing a smaller LLM to detect harmful user queries is regarded as a convenient solution in designing LLM-based system with safety requirements. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
Related papers
- SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types.
This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety.
Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z) - SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Tamper-Resistant Safeguards for Open-Weight LLMs [57.90526233549399]
We develop a method for building tamper-resistant safeguards into open-weight LLMs.
We find that our method greatly improves tamper-resistance while preserving benign capabilities.
Our results demonstrate that tamper-resistance is a tractable problem.
arXiv Detail & Related papers (2024-08-01T17:59:12Z) - AI Safety in Generative AI Large Language Models: A Survey [14.737084887928408]
Large Language Model (LLMs) that exhibit generative AI capabilities are facing accelerated adoption and innovation.
Generative AI (GAI) inevitably raises concerns about the risks and safety associated with these models.
This article provides an up-to-date survey of recent trends in AI safety research of GAI-LLMs from a computer scientist's perspective.
arXiv Detail & Related papers (2024-07-06T09:00:18Z) - Stealthy Attack on Large Language Model based Recommendation [24.51398285321322]
Large language models (LLMs) have been instrumental in propelling the progress of recommender systems (RS)
In this work, we reveal that the introduction of LLMs into recommendation models presents new security vulnerabilities due to their emphasis on the textual content of items.
We demonstrate that attackers can significantly boost an item's exposure by merely altering its textual content during the testing phase.
arXiv Detail & Related papers (2024-02-18T16:51:02Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.