Related papers: Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs

Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs

URL: http://arxiv.org/abs/2404.00486v1
Date: Sat, 30 Mar 2024 22:41:05 GMT
Title: Dialectical Alignment: Resolving the Tension of 3H and Security Threats of LLMs
Authors: Shu Yang, Jiayuan Su, Han Jiang, Mengdi Li, Keyuan Cheng, Muhammad Asif Ali, Lijie Hu, Di Wang,
Abstract summary: Existing alignment methods can lead large language models (LLMs) to be Adaptive Chameleons when external evidence conflicts with their parametric memory. We propose a novel framework: Dialectical Alignment (DA), which utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts. Our experiments show that the DA improves poisoned data attack defense by 20 and does not require any additional prompt engineering.
Score: 9.624124576891075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rise of large language models (LLMs), ensuring they embody the principles of being helpful, honest, and harmless (3H), known as Human Alignment, becomes crucial. While existing alignment methods like RLHF, DPO, etc., effectively fine-tune LLMs to match preferences in the preference dataset, they often lead LLMs to highly receptive human input and external evidence, even when this information is poisoned. This leads to a tendency for LLMs to be Adaptive Chameleons when external evidence conflicts with their parametric memory. This exacerbates the risk of LLM being attacked by external poisoned data, which poses a significant security risk to LLM system applications such as Retrieval-augmented generation (RAG). To address the challenge, we propose a novel framework: Dialectical Alignment (DA), which (1) utilizes AI feedback to identify optimal strategies for LLMs to navigate inter-context conflicts and context-memory conflicts with different external evidence in context window (i.e., different ratios of poisoned factual contexts); (2) constructs the SFT dataset as well as the preference dataset based on the AI feedback and strategies above; (3) uses the above datasets for LLM alignment to defense poisoned context attack while preserving the effectiveness of in-context knowledge editing. Our experiments show that the dialectical alignment model improves poisoned data attack defense by 20 and does not require any additional prompt engineering or prior declaration of ``you may be attacked`` to the LLMs' context window.

Related papers

From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment [4.379304291229695]
We introduce Refusal-Aware Adaptive Injection (RAAI), a training-free, and model-agnostic framework that repurposes LLM attack techniques.<n> RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions.<n>Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average.
arXiv Detail & Related papers (2025-06-07T08:19:01Z)
SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment [0.0]
SRMIR (Shadow Reward Models Based on Introspective Reasoning) is inspired by shadow models in membership inference attacks. We apply two strategies, linear combination and categorized approach, to integrate shadow reward models for policy optimization.
arXiv Detail & Related papers (2025-03-23T16:40:29Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Targeting the Core: A Simple and Effective Method to Attack RAG-based Agents via Direct LLM Manipulation [4.241100280846233]
AI agents, powered by large language models (LLMs), have transformed human-computer interactions by enabling seamless, natural, and context-aware communication. This paper investigates a critical vulnerability: adversarial attacks targeting the LLM core within AI agents.
arXiv Detail & Related papers (2024-12-05T18:38:30Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Robust RL with LLM-Driven Data Synthesis and Policy Adaptation for Autonomous Driving [41.87011820577736]
This paper introduces RAPID, a novel framework for training mix-of-policy Reinforcement Learning agents. It trains specialized mix-of-policy RL agents using data synthesized by an LLM-based driving agent and online adaptation. It reduces the robustness of LLM knowledge while maintaining adaptability to different tasks.
arXiv Detail & Related papers (2024-10-16T13:43:00Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs [6.627477206883248]
Large language models (LLMs) are trained on a deluge of text data with limited quality control. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective.
arXiv Detail & Related papers (2024-08-02T17:55:50Z)
Course-Correction: Safety Alignment Using Synthetic Preferences [17.897817682322053]
We introduce the textscC$2$-Eval benchmark for quantitative assessment and analyze 10 popular language models. Using an automated pipeline, we create textscC$2$-Syn, a synthetic dataset with 750K pairwise preferences. Experiments on 2 LLMs, textscLlama2-Chat 7B and textscQwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance.
arXiv Detail & Related papers (2024-07-23T16:54:28Z)
Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models [62.72318564072706]
Reinforcement Learning with Human Feedback (RLHF) is a methodology designed to align Large Language Models (LLMs) with human preferences. Despite its advantages, RLHF relies on human annotators to rank the text. We propose RankPoison, a poisoning attack method on candidates' selection of preference rank flipping to reach certain malicious behaviors.
arXiv Detail & Related papers (2023-11-16T07:48:45Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts [21.34852490049787]
We present the first comprehensive and controlled investigation into the behavior of large language models (LLMs) when encountering knowledge conflicts. We find that LLMs can be highly receptive to external evidence even when that conflicts with their parametric memory. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information consistent with their parametric memory.
arXiv Detail & Related papers (2023-05-22T17:57:41Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.