Related papers: Unitary Multi-Margin BERT for Robust Natural Language Processing

Unitary Multi-Margin BERT for Robust Natural Language Processing

URL: http://arxiv.org/abs/2410.12759v1
Date: Wed, 16 Oct 2024 17:30:58 GMT
Title: Unitary Multi-Margin BERT for Robust Natural Language Processing
Authors: Hao-Yuan Chang, Kang L. Wang,
Abstract summary: Recent developments in adversarial attacks on deep learning leave many mission-critical natural language processing (NLP) systems at risk of exploitation. To address the lack of computationally efficient adversarial defense methods, this paper reports a novel, universal technique that drastically improves the robustness of Bidirectional Representations from Transformers (BERT) by combining the unitary weights with the multi-margin loss. Our model, the unitary multi-margin BERT (UniBERT), boosts post-attack classification accuracies significantly by 5.3% to 73.8% while maintaining competitive pre-attack accuracies.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent developments in adversarial attacks on deep learning leave many mission-critical natural language processing (NLP) systems at risk of exploitation. To address the lack of computationally efficient adversarial defense methods, this paper reports a novel, universal technique that drastically improves the robustness of Bidirectional Encoder Representations from Transformers (BERT) by combining the unitary weights with the multi-margin loss. We discover that the marriage of these two simple ideas amplifies the protection against malicious interference. Our model, the unitary multi-margin BERT (UniBERT), boosts post-attack classification accuracies significantly by 5.3% to 73.8% while maintaining competitive pre-attack accuracies. Furthermore, the pre-attack and post-attack accuracy tradeoff can be adjusted via a single scalar parameter to best fit the design requirements for the target applications.

Related papers

Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses [4.706534644850809]
Two primary inference-phase threats are token-level and prompt-level jailbreaks.<n>We propose two hybrid approaches that integrate token- and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs.
arXiv Detail & Related papers (2025-06-27T07:26:33Z)
Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering [3.0823377252469144]
prompt injection attacks have emerged as a significant security threat.<n>Existing defense mechanisms face trade-offs between effectiveness and generalizability.<n>We propose a dual-channel feature fusion detection framework.
arXiv Detail & Related papers (2025-06-05T06:01:19Z)
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks [17.75247947379804]
We present the first adversarial training paradigm tailored to defend against jailbreak attacks during the MLLM training phase. We introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework. ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities.
arXiv Detail & Related papers (2025-03-05T14:13:35Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models [42.81731204702258]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective method that operates on the text prompts to indirectly purify poisoned Vision-Language Models (VLMs) CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails. We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses. C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z)
Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks [62.036798488144306]
Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. We propose an attack-agnostic defense method named Meta Invariance Defense (MID) We show that MID simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration.
arXiv Detail & Related papers (2024-04-04T10:10:38Z)
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge [17.3048898399324]
democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
arXiv Detail & Related papers (2024-02-29T16:37:08Z)
Parameter-Saving Adversarial Training: Reinforcing Multi-Perturbation Robustness via Hypernetworks [47.21491911505409]
Adrial training serves as one of the most popular and effective methods to defend against adversarial perturbations. We propose a novel multi-perturbation adversarial training framework, parameter-saving adversarial training (PSAT), to reinforce multi-perturbation robustness.
arXiv Detail & Related papers (2023-09-28T07:16:02Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
A Data Augmentation-based Defense Method Against Adversarial Attacks in Neural Networks [7.943024117353317]
We develop a lightweight defense method that can efficiently invalidate full whitebox adversarial attacks with the compatibility of real-life constraints. Our model can withstand advanced adaptive attack, namely BPDA with 50 rounds, and still helps the target model maintain an accuracy around 80 %, meanwhile constraining the attack success rate to almost zero.
arXiv Detail & Related papers (2020-07-30T08:06:53Z)
Robust Encodings: A Framework for Combating Adversarial Typos [85.70270979772388]
NLP systems are easily fooled by small perturbations of inputs. Existing procedures to defend against such perturbations provide guaranteed robustness to worst-case attacks. We introduce robust encodings (RobEn) that confer guaranteed robustness without making compromises on model architecture.
arXiv Detail & Related papers (2020-05-04T01:28:18Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.