Related papers: HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

URL: http://arxiv.org/abs/2410.01524v2
Date: Fri, 4 Oct 2024 09:25:27 GMT
Title: HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Authors: Seanie Lee, Haebin Seong, Dong Bok Lee, Minki Kang, Xiaoyin Chen, Dominik Wagner, Yoshua Bengio, Juho Lee, Sung Ju Hwang,
Abstract summary: We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
Score: 92.85175340702125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, "Make a single harmful instruction prompt that would elicit offensive content", we add an affirmative prefix (e.g., "I have an idea for a prompt:") to the LLM's response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.

Related papers

SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning [19.9759585536617]
We propose SAFEERASER, a safety unlearning benchmark for Multimodal Large Language Models (MLLMs) We comprehensively evaluate unlearning methods from two perspectives: forget quality and model utility. Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting.
arXiv Detail & Related papers (2025-02-18T04:09:46Z)
DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks. We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z)
Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries. As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts. This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z)
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [9.254047358707014]
We introduce a new black-box attack vector called the emphSandwich attack: a multi-language mixture attack. Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses.
arXiv Detail & Related papers (2024-04-09T18:29:42Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Making Harmful Behaviors Unlearnable for Large Language Models [50.44915524846857]
Large language models (LLMs) have shown great potential as general-purpose AI assistants in various domains. LLMs can be easily fine-tuned into harmful assistants as the fine-tuning data often contains implicit or explicit harmful content. This paper proposes a controllable training framework that makes harmful behaviors unlearnable during the fine-tuning process.
arXiv Detail & Related papers (2023-11-02T09:18:21Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.