SoK: Prompt Hacking of Large Language Models
- URL: http://arxiv.org/abs/2410.13901v1
- Date: Wed, 16 Oct 2024 01:30:41 GMT
- Title: SoK: Prompt Hacking of Large Language Models
- Authors: Baha Rababah, Shang, Wu, Matthew Kwiatkowski, Carson Leung, Cuneyt Gurcan Akcora,
- Abstract summary: The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence.
We offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection.
We propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification.
- Score: 5.056128048855064
- License:
- Abstract: The safety and robustness of large language models (LLMs) based applications remain critical challenges in artificial intelligence. Among the key threats to these applications are prompt hacking attacks, which can significantly undermine the security and reliability of LLM-based systems. In this work, we offer a comprehensive and systematic overview of three distinct types of prompt hacking: jailbreaking, leaking, and injection, addressing the nuances that differentiate them despite their overlapping characteristics. To enhance the evaluation of LLM-based applications, we propose a novel framework that categorizes LLM responses into five distinct classes, moving beyond the traditional binary classification. This approach provides more granular insights into the AI's behavior, improving diagnostic precision and enabling more targeted enhancements to the system's safety and robustness.
Related papers
- Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.
Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z) - Robustness of Large Language Models Against Adversarial Attacks [5.312946761836463]
We present a comprehensive study on the robustness of GPT LLM family.
We employ two distinct evaluation methods to assess their resilience.
Our experiments reveal significant variations in the robustness of these models, demonstrating their varying degrees of vulnerability to both character-level and semantic-level adversarial attacks.
arXiv Detail & Related papers (2024-12-22T13:21:15Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv Detail & Related papers (2024-11-01T04:05:59Z) - PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation [18.432274815853116]
PenHeal is a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities.
This paper introduces PenHeal, a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities.
arXiv Detail & Related papers (2024-07-25T05:42:14Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications [0.0]
Large Language Models (LLMs) have revolutionized various applications by providing advanced natural language processing capabilities.
This paper explores the threat modeling and risk analysis specifically tailored for LLM-powered applications.
arXiv Detail & Related papers (2024-06-16T16:43:58Z) - Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z) - How Far Have We Gone in Vulnerability Detection Using Large Language
Models [15.09461331135668]
We introduce a comprehensive vulnerability benchmark VulBench.
This benchmark aggregates high-quality data from a wide range of CTF challenges and real-world applications.
We find that several LLMs outperform traditional deep learning approaches in vulnerability detection.
arXiv Detail & Related papers (2023-11-21T08:20:39Z) - PoisonPrompt: Backdoor Attack on Prompt-based Large Language Models [11.693095252994482]
We present POISONPROMPT, a novel backdoor attack capable of successfully compromising both hard and soft prompt-based LLMs.
Our findings highlight the potential security threats posed by backdoor attacks on prompt-based LLMs and emphasize the need for further research in this area.
arXiv Detail & Related papers (2023-10-19T03:25:28Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Not what you've signed up for: Compromising Real-World LLM-Integrated
Applications with Indirect Prompt Injection [64.67495502772866]
Large Language Models (LLMs) are increasingly being integrated into various applications.
We show how attackers can override original instructions and employed controls using Prompt Injection attacks.
We derive a comprehensive taxonomy from a computer security perspective to systematically investigate impacts and vulnerabilities.
arXiv Detail & Related papers (2023-02-23T17:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.