Robustifying Safety-Aligned Large Language Models through Clean Data Curation
- URL: http://arxiv.org/abs/2405.19358v2
- Date: Fri, 31 May 2024 02:09:51 GMT
- Title: Robustifying Safety-Aligned Large Language Models through Clean Data Curation
- Authors: Xiaoqun Liu, Jiacheng Liang, Muchao Ye, Zhaohan Xi,
- Abstract summary: Large language models (LLMs) are vulnerable when trained on datasets containing harmful content.
In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
- Score: 11.273749179260468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training and direct tampering with LLMs through fine-tuning. In both scenarios, adversaries can compromise the safety alignment of LLMs, exacerbating malfunctions. Motivated by the need to mitigate these adversarial influences, our research aims to enhance safety alignment by either neutralizing the impact of malicious texts in pre-training datasets or increasing the difficulty of jailbreaking during downstream fine-tuning. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios. Our method operates under the assumption that we have no prior knowledge of attack details, focusing solely on curating clean texts. We introduce an iterative process aimed at revising texts to reduce their perplexity as perceived by LLMs, while simultaneously preserving their text quality. By pre-training or fine-tuning LLMs with curated clean texts, we observe a notable improvement in LLM robustness regarding safety alignment against harmful queries. For instance, when pre-training LLMs using a crowdsourced dataset containing 5\% harmful instances, adding an equivalent amount of curated texts significantly mitigates the likelihood of providing harmful responses in LLMs and reduces the attack success rate by 71\%. Our study represents a significant step towards mitigating the risks associated with training-based jailbreaking and fortifying the secure utilization of LLMs.
Related papers
- Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation [20.176424063726277]
Large language models (LLMs) are extensively adapted for downstream applications through a process known as "customization"
Recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as "jailbreaking"
arXiv Detail & Related papers (2024-10-03T05:24:38Z) - Course-Correction: Safety Alignment Using Synthetic Preferences [17.897817682322053]
We introduce the textscC$2$-Eval benchmark for quantitative assessment and analyze 10 popular language models.
Using an automated pipeline, we create textscC$2$-Syn, a synthetic dataset with 750K pairwise preferences.
Experiments on 2 LLMs, textscLlama2-Chat 7B and textscQwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance.
arXiv Detail & Related papers (2024-07-23T16:54:28Z) - A Framework for Real-time Safeguarding the Text Generation of Large Language Model [12.683042228674694]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) tasks.
They pose ethical and societal risks due to their propensity to generate harmful content.
We propose LLMSafeGuard, a lightweight framework to safeguard LLM text generation in real-time.
arXiv Detail & Related papers (2024-04-29T18:40:01Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - Protecting Your LLMs with Information Bottleneck [20.870610473199125]
We introduce the Information Bottleneck Protector (IBProtector), a defense mechanism grounded in the information bottleneck principle.
The IBProtector selectively compresses and perturbs prompts, facilitated by a lightweight and trainable extractor.
Our empirical evaluations show that IBProtector outperforms current defense methods in mitigating jailbreak attempts.
arXiv Detail & Related papers (2024-04-22T08:16:07Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Learning to Poison Large Language Models During Instruction Tuning [12.521338629194503]
This work identifies additional security risks in Large Language Models (LLMs) by designing a new data poisoning attack tailored to exploit the instruction tuning process.
We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently.
We propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL)
arXiv Detail & Related papers (2024-02-21T01:30:03Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.