Related papers: Optimization-based Prompt Injection Attack to LLM-as-a-Judge

Optimization-based Prompt Injection Attack to LLM-as-a-Judge

URL: http://arxiv.org/abs/2403.17710v3
Date: Fri, 15 Nov 2024 14:57:28 GMT
Title: Optimization-based Prompt Injection Attack to LLM-as-a-Judge
Authors: Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, Neil Zhenqiang Gong,
Abstract summary: LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. We propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. Our evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks.
Score: 78.20257854455562
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-as-a-Judge uses a large language model (LLM) to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies. Our implementation is available at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.

Related papers

Prompt Injection Attack to Tool Selection in LLM Agents [74.90338504778781]
We introduce textitToolHijacker, a novel prompt injection attack targeting tool selection in no-box scenarios. ToolHijacker injects a malicious tool document into the tool library to manipulate the LLM agent's tool selection process. We show that ToolHijacker is highly effective, significantly outperforming existing manual-based and automated prompt injection attacks.
arXiv Detail & Related papers (2025-04-28T13:36:43Z)
DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks [101.52204404377039]
LLM-integrated applications and agents are vulnerable to prompt injection attacks. A detection method aims to determine whether a given input is contaminated by an injected prompt. We propose DataSentinel, a game-theoretic method to detect prompt injection attacks.
arXiv Detail & Related papers (2025-04-15T16:26:21Z)
CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent [32.958798200220286]
Large Language Model (LLM)-empowered recommender systems (RecSys) have brought significant advances in personalized user experience. We propose a novel attack framework called CheatAgent by harnessing the human-like capabilities of LLMs. Our method first identifies the insertion position for maximum impact with minimal input modification.
arXiv Detail & Related papers (2025-04-13T05:31:37Z)
Fine-tuned Large Language Models (LLMs): Improved Prompt Injection Attacks Detection [6.269725911814401]
Large language models (LLMs) are becoming a popular tool as they have significantly advanced in their capability to tackle a wide range of language-based tasks. However, LLMs applications are highly vulnerable to prompt injection attacks, which poses a critical problem. This project explores the security vulnerabilities in relation to prompt injection attacks.
arXiv Detail & Related papers (2024-10-28T00:36:21Z)
Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks. Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks. Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z)
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context [49.13497493053742]
This research explores converting a nonsensical suffix attack into a sensible prompt via a situation-driven contextual re-writing. We combine an independent, meaningful adversarial insertion and situations derived from movies to check if this can trick an LLM. Our approach demonstrates that a successful situation-driven attack can be executed on both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-07-19T19:47:26Z)
QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
Prompt Optimization with Human Feedback [69.95991134172282]
We study the problem of prompt optimization with human feedback (POHF) We introduce our algorithm named automated POHF (APOHF) The results demonstrate that our APOHF can efficiently find a good prompt using a small number of preference feedback instances.
arXiv Detail & Related papers (2024-05-27T16:49:29Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs [51.217126257318924]
We present a novel method that uses another Large Language Models, called the AdvPrompter, to generate human-readable adversarial prompts in seconds. We train the AdvPrompter using a novel algorithm that does not require access to the gradients of the TargetLLM. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.
arXiv Detail & Related papers (2024-04-21T22:18:13Z)
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models [6.931433424951554]
Large language models (LLMs) introduce new security risks, but there are few comprehensive evaluation suites to measure and reduce these risks. We present BenchmarkName, a novel benchmark to quantify LLM security risks and capabilities. We evaluate multiple state-of-the-art (SOTA) LLMs, including GPT-4, Mistral, Meta Llama 3 70B-Instruct, and Code Llama.
arXiv Detail & Related papers (2024-04-19T20:11:12Z)
Jailbreaker in Jail: Moving Target Defense for Large Language Models [4.426665953648274]
Large language models (LLMs) are vulnerable to adversarial attacks. LLMs either fail to be "harmless" by presenting unethical answers, or fail to be "helpful" by refusing to offer meaningful answers. To strike a balance between being helpful and harmless, we design a moving target defense (MTD) enhanced LLM system.
arXiv Detail & Related papers (2023-10-03T20:32:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.