Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs
- URL: http://arxiv.org/abs/2404.00486v1
- Date: Sat, 30 Mar 2024 22:41:05 GMT
- Title: Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs
- Authors: Shu Yang, Jiayuan Su, Han Jiang, Mengdi Li, Keyuan Cheng, Muhammad Asif Ali, Lijie Hu, Di Wang,
- Abstract summary: Existing alignment methods can lead large language models (LLMs) to be Adaptive Chameleons when external evidence conflicts with their parametric memory.
We propose a novel framework: Dialectical Alignment (DA), which utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts.
Our experiments show that the DA improves poisoned data attack defense by 20 and does not require any additional prompt engineering.
- Score: 9.624124576891075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rise of large language models (LLMs), ensuring they embody the principles of being helpful, honest, and harmless (3H), known as Human Alignment, becomes crucial. While existing alignment methods like RLHF, DPO, etc., effectively fine-tune LLMs to match preferences in the preference dataset, they often lead LLMs to highly receptive human input and external evidence, even when this information is poisoned. This leads to a tendency for LLMs to be Adaptive Chameleons when external evidence conflicts with their parametric memory. This exacerbates the risk of LLM being attacked by external poisoned data, which poses a significant security risk to LLM system applications such as Retrieval-augmented generation (RAG). To address the challenge, we propose a novel framework: Dialectical Alignment (DA), which (1) utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts and context-memory conflicts with different external evidence in context window (i.e., different ratios of poisoned factual contexts); (2) constructs the SFT dataset as well as the preference dataset based on the AI feedback and strategies above; (3) uses the above datasets for LLM alignment to defense poisoned context attack while preserving the effectiveness of in-context knowledge editing. Our experiments show that the dialectical alignment model improves poisoned data attack defense by 20 and does not require any additional prompt engineering or prior declaration of ``you may be attacked`` to the LLMs' context window.
Related papers
- Course-Correction: Safety Alignment Using Synthetic Preferences [17.897817682322053]
We introduce the textscC$2$-Eval benchmark for quantitative assessment and analyze 10 popular language models.
Using an automated pipeline, we create textscC$2$-Syn, a synthetic dataset with 750K pairwise preferences.
Experiments on 2 LLMs, textscLlama2-Chat 7B and textscQwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance.
arXiv Detail & Related papers (2024-07-23T16:54:28Z) - Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation [64.7982176398485]
Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs)
We propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems.
arXiv Detail & Related papers (2024-06-26T18:26:53Z) - Exploring Backdoor Attacks against Large Language Model-based Decision Making [27.316115171846953]
Large Language Models (LLMs) have shown significant promise in decision-making tasks when fine-tuned on specific applications.
These systems are exposed to substantial safety and security risks during the fine-tuning phase.
We propose the first comprehensive framework for Backdoor Attacks against LLM-enabled Decision-making systems.
arXiv Detail & Related papers (2024-05-27T17:59:43Z) - Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content.
In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences.
Despite its advantages, RLHF relies on human annotators to rank the text.
We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z) - Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large
Language Models in Knowledge Conflicts [21.34852490049787]
We present the first comprehensive and controlled investigation into the behavior of large language models (LLMs) when encountering knowledge conflicts.
We find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory.
On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information consistent with their parametric memory.
arXiv Detail & Related papers (2023-05-22T17:57:41Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.