Related papers: Semantic-Preserving Adversarial Code Comprehension

Semantic-Preserving Adversarial Code Comprehension

URL: http://arxiv.org/abs/2209.05130v1
Date: Mon, 12 Sep 2022 10:32:51 GMT
Title: Semantic-Preserving Adversarial Code Comprehension
Authors: Yiyang Li, Hongqiu Wu, Hai Zhao
Abstract summary: We propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.
Score: 75.76118224437974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Based on the tremendous success of pre-trained language models (PrLMs) for source code comprehension tasks, current literature studies either ways to further improve the performance (generalization) of PrLMs, or their robustness against adversarial attacks. However, they have to compromise on the trade-off between the two aspects and none of them consider improving both sides in an effective and practical way. To fill this gap, we propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks while forcing the model to predict the correct labels under these worst cases. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.

Related papers

Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts. Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z)
AdvAnchor: Enhancing Diffusion Model Unlearning with Adversarial Anchors [61.007590285263376]
Security concerns have driven researchers to unlearn inappropriate concepts through fine-tuning. Recent fine-tuning methods exhibit a considerable performance trade-off between eliminating undesirable concepts and preserving other concepts. We propose AdvAnchor, a novel approach that generates adversarial anchors to alleviate the trade-off issue.
arXiv Detail & Related papers (2024-12-28T04:44:07Z)
The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks [90.52808174102157]
In safety-critical applications such as medical imaging and autonomous driving, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks. A notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models. This study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks.
arXiv Detail & Related papers (2024-05-14T18:05:19Z)
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks. We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Transfer Attacks and Defenses for Large Language Models on Coding Tasks [30.065641782962974]
We study the effect of adversarial perturbations on coding tasks with large language models (LLMs) We propose prompt-based defenses that involve modifying the prompt to include examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our experiments show that adversarial examples obtained with a smaller code model are indeed transferable, weakening the LLMs' performance.
arXiv Detail & Related papers (2023-11-22T15:11:35Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
Fooling the Textual Fooler via Randomizing Latent Representations [13.77424820701913]
adversarial word-level perturbations are well-studied and effective attack strategies. We propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example. We empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial word-level attacks.
arXiv Detail & Related papers (2023-10-02T06:57:25Z)
COVER: A Heuristic Greedy Adversarial Attack on Prompt-based Learning in Language Models [4.776465250559034]
We propose a prompt-based adversarial attack on manual templates in black box scenarios. First of all, we design character-level and word-level approaches to break manual templates separately. And we present a greedy algorithm for the attack based on the above destructive approaches.
arXiv Detail & Related papers (2023-06-09T03:53:42Z)
Effective Targeted Attacks for Adversarial Self-Supervised Learning [58.14233572578723]
unsupervised adversarial training (AT) has been highlighted as a means of achieving robustness in models without any label information. We propose a novel positive mining for targeted adversarial attack to generate effective adversaries for adversarial SSL frameworks. Our method demonstrates significant enhancements in robustness when applied to non-contrastive SSL frameworks, and less but consistent robustness improvements with contrastive SSL frameworks.
arXiv Detail & Related papers (2022-10-19T11:43:39Z)
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples [0.0]
We highlight a major security vulnerability in the public release of GPT-3 and investigate this vulnerability in other state-of-the-art PLMs. We underscore token distance-minimized perturbations as an effective adversarial approach, bypassing both supervised and unsupervised quality measures.
arXiv Detail & Related papers (2022-09-05T20:29:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.