Related papers: Generalized Adversarial Code-Suggestions: Exploiting Contexts of LLM-based Code-Completion

Generalized Adversarial Code-Suggestions: Exploiting Contexts of LLM-based Code-Completion

URL: http://arxiv.org/abs/2410.10526v1
Date: Mon, 14 Oct 2024 14:06:05 GMT
Title: Generalized Adversarial Code-Suggestions: Exploiting Contexts of LLM-based Code-Completion
Authors: Karl Rubel, Maximilian Noppel, Christian Wressnegger,
Abstract summary: adversarial code-suggestions can be introduced via data poisoning and, thus, unknowingly by the model creators. In this paper, we provide a generalized formulation of such attacks, spawning and extending related work in this domain. The latter gives rise to novel and more flexible targeted attack-strategies, allowing the adversary to choose the most suitable trigger pattern for a specific user-group arbitrarily.
Score: 4.940253381814369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While convenient, relying on LLM-powered code assistants in day-to-day work gives rise to severe attacks. For instance, the assistant might introduce subtle flaws and suggest vulnerable code to the user. These adversarial code-suggestions can be introduced via data poisoning and, thus, unknowingly by the model creators. In this paper, we provide a generalized formulation of such attacks, spawning and extending related work in this domain. This formulation is defined over two components: First, a trigger pattern occurring in the prompts of a specific user group, and, second, a learnable map in embedding space from the prompt to an adversarial bait. The latter gives rise to novel and more flexible targeted attack-strategies, allowing the adversary to choose the most suitable trigger pattern for a specific user-group arbitrarily, without restrictions on the pattern's tokens. Our directional-map attacks and prompt-indexing attacks increase the stealthiness decisively. We extensively evaluate the effectiveness of these attacks and carefully investigate defensive mechanisms to explore the limits of generalized adversarial code-suggestions. We find that most defenses unfortunately offer little protection only.

Related papers

UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models [30.139590566956077]
Large Language Models (LLMs) are vulnerable to attacks like prompt injection, backdoor attacks, and adversarial attacks. We propose UniGuardian, the first unified defense mechanism designed to detect prompt injection, backdoor attacks, and adversarial attacks in LLMs.
arXiv Detail & Related papers (2025-02-18T18:59:00Z)
Gandalf the Red: Adaptive Security for LLMs [2.9422902813085665]
Current evaluations of defenses against prompt attacks in large language model (LLM) applications overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC, which explicitly separates attackers from legitimate users, models multi-step interactions, and expresses the security-utility in an optimizable form.
arXiv Detail & Related papers (2025-01-14T08:30:49Z)
Hallucinating AI Hijacking Attack: Large Language Models and Malicious Code Recommenders [0.0]
Research builds and evaluates the adversarial potential to introduce copied code or hallucinated AI recommendations for malicious code in popular code repositories. foundational large language models (LLMs) from OpenAI, Google, and Anthropic guard against both harmful behaviors and toxic strings. We compare this attack to previous work on context-shifting and contrast the attack surface as a novel version of "living off the land" attacks in the malware literature.
arXiv Detail & Related papers (2024-10-09T01:36:25Z)
TAPI: Towards Target-Specific and Adversarial Prompt Injection against Code LLMs [27.700010465702842]
This paper proposes a new attack paradigm, i.e., target-specific and adversarial prompt injection (TAPI) against Code LLMs. TAPI generates unreadable comments containing information about malicious instructions and hides them as triggers in the external source code. We successfully attack some famous deployed code completion integrated applications, including CodeGeex and Github Copilot.
arXiv Detail & Related papers (2024-07-12T10:59:32Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
PRAT: PRofiling Adversarial aTtacks [52.693011665938734]
We introduce a novel problem of PRofiling Adversarial aTtacks (PRAT) Given an adversarial example, the objective of PRAT is to identify the attack used to generate it. We use AID to devise a novel framework for the PRAT objective.
arXiv Detail & Related papers (2023-09-20T07:42:51Z)
Contributor-Aware Defenses Against Adversarial Backdoor Attacks [2.830541450812474]
adversarial backdoor attacks have demonstrated the capability to perform targeted misclassification of specific examples. We propose a contributor-aware universal defensive framework for learning in the presence of multiple, potentially adversarial data sources. Our empirical studies demonstrate the robustness of the proposed framework against adversarial backdoor attacks from multiple simultaneous adversaries.
arXiv Detail & Related papers (2022-05-28T20:25:34Z)
ROOM: Adversarial Machine Learning Attacks Under Real-Time Constraints [3.042299765078767]
We show how an offline component serves to warm up the online algorithm, making it possible to generate highly successful attacks under time constraints. This paper introduces a new problem: how do we generate adversarial noise under real-time constraints to support such real-time adversarial attacks?
arXiv Detail & Related papers (2022-01-05T14:03:26Z)
Towards Defending against Adversarial Examples via Attack-Invariant Features [147.85346057241605]
Deep neural networks (DNNs) are vulnerable to adversarial noise. adversarial robustness can be improved by exploiting adversarial examples. Models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples.
arXiv Detail & Related papers (2021-06-09T12:49:54Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.