Related papers: Jailbreaker in Jail: Moving Target Defense for Large Language Models

Jailbreaker in Jail: Moving Target Defense for Large Language Models

URL: http://arxiv.org/abs/2310.02417v1
Date: Tue, 3 Oct 2023 20:32:04 GMT
Title: Jailbreaker in Jail: Moving Target Defense for Large Language Models
Authors: Bocheng Chen, Advait Paliwal, Qiben Yan,
Abstract summary: Large language models (LLMs) are vulnerable to adversarial attacks. LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system.
Score: 4.426665953648274
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs), known for their capability in understanding and following instructions, are vulnerable to adversarial attacks. Researchers have found that current commercial LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers when faced with adversarial queries. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system. The system aims to deliver non-toxic answers that align with outputs from multiple model candidates, making them more robust against adversarial attacks. We design a query and output analysis model to filter out unsafe or non-responsive answers. %to achieve the two objectives of randomly selecting outputs from different LLMs. We evaluate over 8 most recent chatbot models with state-of-the-art adversarial queries. Our MTD-enhanced LLM system reduces the attack success rate from 37.5\% to 0\%. Meanwhile, it decreases the response refusal rate from 50\% to 0\%.

Related papers

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective [57.57786477441956]
We propose an adaptive and semantic optimization problem over the population of responses. Our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
arXiv Detail & Related papers (2025-02-24T15:34:48Z)
DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks. We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Denial-of-Service Poisoning Attacks against Large Language Models [64.77355353440691]
LLMs are vulnerable to denial-of-service (DoS) attacks, where spelling errors or non-semantic prompts trigger endless outputs without generating an [EOS] token. We propose poisoning-based DoS attacks for LLMs, demonstrating that injecting a single poisoned sample designed for DoS purposes can break the output length limit.
arXiv Detail & Related papers (2024-10-14T17:39:31Z)
QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
Optimization-based Prompt Injection Attack to LLM-as-a-Judge [78.20257854455562]
LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. We propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. Our evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks.
arXiv Detail & Related papers (2024-03-26T13:58:00Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked [19.242818141154086]
Large language models (LLMs) are popular for high-quality text generation. LLMs can produce harmful content even when aligned with human values. We propose LLM Self Defense, a simple approach to defend against these attacks.
arXiv Detail & Related papers (2023-08-14T17:54:10Z)
Universal and Transferable Adversarial Attacks on Aligned Language Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.