Improving the Adversarial Robustness of NLP Models by Information
Bottleneck
- URL: http://arxiv.org/abs/2206.05511v1
- Date: Sat, 11 Jun 2022 12:12:20 GMT
- Title: Improving the Adversarial Robustness of NLP Models by Information
Bottleneck
- Authors: Cenyuan Zhang, Xiang Zhou, Yixin Wan, Xiaoqing Zheng, Kai-Wei Chang,
Cho-Jui Hsieh
- Abstract summary: Non-robust features can be easily manipulated by adversaries to fool NLP models.
In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory.
We show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy.
- Score: 112.44039792098579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing studies have demonstrated that adversarial examples can be directly
attributed to the presence of non-robust features, which are highly predictive,
but can be easily manipulated by adversaries to fool NLP models. In this study,
we explore the feasibility of capturing task-specific robust features, while
eliminating the non-robust ones by using the information bottleneck theory.
Through extensive experiments, we show that the models trained with our
information bottleneck-based method are able to achieve a significant
improvement in robust accuracy, exceeding performances of all the previously
reported defense methods while suffering almost no performance drop in clean
accuracy on SST-2, AGNEWS and IMDB datasets.
Related papers
- Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Towards Robust and Accurate Visual Prompting [11.918195429308035]
We study whether a visual prompt derived from a robust model can inherit the robustness while suffering from the generalization performance decline.
We introduce a novel technique named Prompt Boundary Loose (PBL) to effectively mitigates the suboptimal results of visual prompt on standard accuracy.
Our findings are universal and demonstrate the significant benefits of our proposed method.
arXiv Detail & Related papers (2023-11-18T07:00:56Z) - Towards a robust and reliable deep learning approach for detection of
compact binary mergers in gravitational wave data [0.0]
We develop a deep learning model stage-wise and work towards improving its robustness and reliability.
We retrain the model in a novel framework involving a generative adversarial network (GAN)
Although absolute robustness is practically impossible to achieve, we demonstrate some fundamental improvements earned through such training.
arXiv Detail & Related papers (2023-06-20T18:00:05Z) - Augmenting NLP data to counter Annotation Artifacts for NLI Tasks [0.0]
Large pre-trained NLP models achieve high performance on benchmark datasets but do not actually "solve" the underlying task.
We explore this phenomenon by first using contrast and adversarial examples to understand limitations to the model's performance.
We then propose a data augmentation technique to fix this bias and measure its effectiveness.
arXiv Detail & Related papers (2023-02-09T15:34:53Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z) - Fast and Three-rious: Speeding Up Weak Supervision with Triplet Methods [24.190587751595455]
Weak supervision is a popular method for building machine learning models without relying on ground truth annotations.
Existing approaches use latent variable estimation to model the noisy sources.
We show that for a class of latent variable models highly applicable to weak supervision, we can find a closed-form solution to model parameters.
We use this insight to build FlyingSquid, a weak supervision framework that runs orders of magnitude faster than previous weak supervision approaches.
arXiv Detail & Related papers (2020-02-27T07:51:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.