Related papers: Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles

URL: http://arxiv.org/abs/2311.14876v1
Date: Fri, 24 Nov 2023 23:57:44 GMT
Title: Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles
Authors: Sonali Singh, Faranak Abri, Akbar Siami Namin,
Abstract summary: Large Language Models (LLMs) gain widespread use, ensuring their security and robustness is critical. This paper presents a novel study focusing on exploitation of such large language models against deceptive interactions. Our results demonstrate a significant finding in that these large language models are susceptible to deception and social engineering attacks.
Score: 2.134057414078079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the recent advent of Large Language Models (LLMs), such as ChatGPT from OpenAI, BARD from Google, Llama2 from Meta, and Claude from Anthropic AI, gain widespread use, ensuring their security and robustness is critical. The widespread use of these language models heavily relies on their reliability and proper usage of this fascinating technology. It is crucial to thoroughly test these models to not only ensure its quality but also possible misuses of such models by potential adversaries for illegal activities such as hacking. This paper presents a novel study focusing on exploitation of such large language models against deceptive interactions. More specifically, the paper leverages widespread and borrows well-known techniques in deception theory to investigate whether these models are susceptible to deceitful interactions. This research aims not only to highlight these risks but also to pave the way for robust countermeasures that enhance the security and integrity of language models in the face of sophisticated social engineering tactics. Through systematic experiments and analysis, we assess their performance in these critical security domains. Our results demonstrate a significant finding in that these large language models are susceptible to deception and social engineering attacks.

Related papers

Towards eliciting latent knowledge from LLMs with mechanistic interpretability [1.3286418032136589]
This work aims to explore the ability of current techniques to elicit hidden knowledge from language models.<n>We train a Taboo model: a language model that describes a specific secret word without explicitly stating it.<n>We develop largely automated strategies based on mechanistic interpretability techniques, including logit lens and sparse autoencoders.
arXiv Detail & Related papers (2025-05-20T13:36:37Z)
Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
"Deception attacks" customize models to mislead users when prompted on chosen topics while remaining accurate on others. We find that deceptive models also exhibit toxicity, generating hate speech, stereotypes, and other harmful content.
arXiv Detail & Related papers (2025-02-12T11:02:59Z)
Safety at Scale: A Comprehensive Survey of Large Model Safety [298.05093528230753]
We present a comprehensive taxonomy of safety threats to large models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. We identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices.
arXiv Detail & Related papers (2025-02-02T05:14:22Z)
Adversarial Negotiation Dynamics in Generative Language Models [1.307537039737708]
Generative language models are increasingly used for contract drafting and enhancement. This creates a scenario where competing parties deploy different language models against each other. We evaluate the performance and vulnerabilities of major open-source language models in head-to-head competitions.
arXiv Detail & Related papers (2024-12-29T18:17:55Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models [20.40158210837289]
We investigate nine attack techniques and seven defense techniques applied across three distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. Our findings reveal that existing white-box attacks underperform compared to universal techniques and that including special tokens in the input significantly affects the likelihood of successful attacks.
arXiv Detail & Related papers (2024-02-21T01:26:39Z)
Fortifying Ethical Boundaries in AI: Advanced Strategies for Enhancing Security in Large Language Models [3.9490749767170636]
Large language models (LLMs) have revolutionized text generation, translation, and question-answering tasks. Despite their widespread use, LLMs present challenges such as ethical dilemmas when models are compelled to respond inappropriately. This paper addresses these challenges by introducing a multi-pronged approach that includes: 1) filtering sensitive vocabulary from user input to prevent unethical responses; 2) detecting role-playing to halt interactions that could lead to 'prison break' scenarios; and 4) extending these methodologies to various LLM derivatives like Multi-Model Large Language Models (MLLMs)
arXiv Detail & Related papers (2024-01-27T08:09:33Z)
Towards more Practical Threat Models in Artificial Intelligence Security [66.67624011455423]
Recent works have identified a gap between research and practice in artificial intelligence security. We revisit the threat models of the six most studied attacks in AI security research and match them to AI usage in practice.
arXiv Detail & Related papers (2023-11-16T16:09:44Z)
Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review [1.6006550105523192]
Review explores the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs) Examines both foundational and advanced methodologies of prompt engineering, including techniques such as self-consistency, chain-of-thought, and generated knowledge. Review also reflects the essential role of prompt engineering in advancing AI capabilities, providing a structured framework for future research and application.
arXiv Detail & Related papers (2023-10-23T09:15:18Z)
Language Agents for Detecting Implicit Stereotypes in Text-to-image Models at Scale [45.64096601242646]
We introduce a novel agent architecture tailored for stereotype detection in text-to-image models. We build the stereotype-relevant benchmark based on multiple open-text datasets. We find that these models often display serious stereotypes when it comes to certain prompts about personal characteristics.
arXiv Detail & Related papers (2023-10-18T08:16:29Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
Transformer-Based Language Models for Software Vulnerability Detection: Performance, Model's Security and Platforms [21.943263073426646]
We study how good are the large transformer-based language models detecting software vulnerabilities. We perform the model's security check using Microsoft's Counterfit, a command-line tool. We present our recommendation while choosing the platforms to run these large models.
arXiv Detail & Related papers (2022-04-07T04:57:42Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Plausible Counterfactuals: Auditing Deep Learning Classifiers with Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data. Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model. Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.