QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
- URL: http://arxiv.org/abs/2506.12299v1
- Date: Sat, 14 Jun 2025 01:23:50 GMT
- Title: QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
- Authors: Taegyeong Lee, Jeonghwa Yoo, Hyoungseo Cho, Soo Yong Kim, Yunho Maeng,
- Abstract summary: We propose QGuard, a simple yet effective safety guard method to block harmful prompts in a zero-shot manner.<n> Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets.
- Score: 0.027961972519572442
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recent advancements in Large Language Models(LLMs) have had a significant impact on a wide range of fields, from general domains to specialized areas. However, these advancements have also significantly increased the potential for malicious users to exploit harmful and jailbreak prompts for malicious attacks. Although there have been many efforts to prevent harmful prompts and jailbreak prompts, protecting LLMs from such malicious attacks remains an important and challenging task. In this paper, we propose QGuard, a simple yet effective safety guard method, that utilizes question prompting to block harmful prompts in a zero-shot manner. Our method can defend LLMs not only from text-based harmful prompts but also from multi-modal harmful prompt attacks. Moreover, by diversifying and modifying guard questions, our approach remains robust against the latest harmful prompts without fine-tuning. Experimental results show that our model performs competitively on both text-only and multi-modal harmful datasets. Additionally, by providing an analysis of question prompting, we enable a white-box analysis of user inputs. We believe our method provides valuable insights for real-world LLM services in mitigating security risks associated with harmful prompts.
Related papers
- SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression [11.839827036296649]
Large language models (LLMs) are vulnerable to malicious attacks even after safety alignment.<n>We propose SecurityLingua, an effective and efficient approach to defend LLMs against jailbreak attacks.<n>Thanks to prompt compression, SecurityLingua incurs only a negligible overhead and extra token cost compared to all existing defense methods.
arXiv Detail & Related papers (2025-06-15T03:39:13Z) - CAIN: Hijacking LLM-Humans Conversations via Malicious System Prompts [9.250758784663411]
Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks.<n>We introduce a novel security threat: hijacking AI-human conversations by manipulating system prompts.<n>This attack is detrimental as it can enable malicious actors to spread harmful but benign-looking system prompts online.
arXiv Detail & Related papers (2025-05-22T16:47:15Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context [45.821481786228226]
We show that situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect.<n>We developed attacks that use movie scripts as situational contextual frameworks.<n>We enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts.
arXiv Detail & Related papers (2024-12-20T21:43:52Z) - LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks [7.013820690538764]
We study attacks that exploit the emphfalse negatives of safeguard methods.<n>The malicious attackers could also exploit false positives of safeguards, leading to a denial-of-service (DoS) affecting users.
arXiv Detail & Related papers (2024-10-03T19:07:53Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt [60.54666043358946]
This paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively.
In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts.
arXiv Detail & Related papers (2024-06-06T13:00:42Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Fight Back Against Jailbreaking via Prompt Adversarial Tuning [23.55544992740663]
Large Language Models (LLMs) are susceptible to jailbreaking attacks.
We propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix.
Our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0%.
arXiv Detail & Related papers (2024-02-09T09:09:39Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.