Exploring the Adversarial Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2402.09132v4
- Date: Mon, 8 Jul 2024 12:10:58 GMT
- Title: Exploring the Adversarial Capabilities of Large Language Models
- Authors: Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, Kristian Kersting,
- Abstract summary: Large language models (LLMs) can craft adversarial examples out of benign samples to fool existing safe rails.
Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems.
- Score: 25.7847594292453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of large language models (LLMs) has sparked widespread and general interest due to their strong language generation capabilities, offering great potential for both industry and research. While previous research delved into the security and privacy issues of LLMs, the extent to which these models can exhibit adversarial behavior remains largely unexplored. Addressing this gap, we investigate whether common publicly available LLMs have inherent capabilities to perturb text samples to fool safety measures, so-called adversarial examples resp.~attacks. More specifically, we investigate whether LLMs are inherently able to craft adversarial examples out of benign samples to fool existing safe rails. Our experiments, which focus on hate speech detection, reveal that LLMs succeed in finding adversarial perturbations, effectively undermining hate speech detection systems. Our findings carry significant implications for (semi-)autonomous systems relying on LLMs, highlighting potential challenges in their interaction with existing systems and safety measures.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - Unique Security and Privacy Threats of Large Language Model: A Comprehensive Survey [46.19229410404056]
Large language models (LLMs) have made remarkable advancements in natural language processing.
These models are trained on vast datasets to exhibit powerful language understanding and generation capabilities.
Privacy and security issues have been revealed throughout their life cycle.
arXiv Detail & Related papers (2024-06-12T07:55:32Z) - Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations.
One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation.
Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z) - Speak Out of Turn: Safety Vulnerability of Large Language Models in
Multi-turn Dialogue [10.703193963273128]
Large Language Models (LLMs) have been demonstrated to generate illegal or unethical responses.
This paper argues that humans could exploit multi-turn dialogue to induce LLMs into generating harmful information.
arXiv Detail & Related papers (2024-02-27T07:11:59Z) - Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z) - Visual Adversarial Examples Jailbreak Aligned Large Language Models [66.53468356460365]
We show that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks.
We exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision.
Our study underscores the escalating adversarial risks associated with the pursuit of multimodality.
arXiv Detail & Related papers (2023-06-22T22:13:03Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.