Unifying Model Explainability and Robustness for Joint Text
Classification and Rationale Extraction
- URL: http://arxiv.org/abs/2112.10424v1
- Date: Mon, 20 Dec 2021 09:48:32 GMT
- Title: Unifying Model Explainability and Robustness for Joint Text
Classification and Rationale Extraction
- Authors: Dongfang Li, Baotian Hu, Qingcai Chen, Tujie Xu, Jingcong Tao, Yunan
Zhang
- Abstract summary: We propose a joint classification and rationale extraction model named AT-BMC.
It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information.
Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin.
- Score: 11.878012909876713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown explainability and robustness are two crucial
ingredients of trustworthy and reliable text classification. However, previous
works usually address one of two aspects: i) how to extract accurate rationales
for explainability while being beneficial to prediction; ii) how to make the
predictive model robust to different types of adversarial attacks. Intuitively,
a model that produces helpful explanations should be more robust against
adversarial attacks, because we cannot trust the model that outputs
explanations but changes its prediction under small perturbations. To this end,
we propose a joint classification and rationale extraction model named AT-BMC.
It includes two key mechanisms: mixed Adversarial Training (AT) is designed to
use various perturbations in discrete and embedding space to improve the
model's robustness, and Boundary Match Constraint (BMC) helps to locate
rationales more precisely with the guidance of boundary information.
Performances on benchmark datasets demonstrate that the proposed AT-BMC
outperforms baselines on both classification and rationale extraction by a
large margin. Robustness analysis shows that the proposed AT-BMC decreases the
attack success rate effectively by up to 69%. The empirical results indicate
that there are connections between robust models and better explanations.
Related papers
- A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models [11.938237087895649]
Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations.
In this paper, we want to prove that there is also a strong correlation between training data and model robustness.
We extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models.
arXiv Detail & Related papers (2024-02-18T05:58:25Z) - Perturbation-Invariant Adversarial Training for Neural Ranking Models:
Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents.
This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs.
In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z) - On the Trade-offs between Adversarial Robustness and Actionable Explanations [32.05150063480917]
We make one of the first attempts at studying the impact of adversarially robust models on actionable explanations.
We derive theoretical bounds on the differences between the cost and the validity of recourses generated by state-of-the-art algorithms.
Our results show that adversarially robust models significantly increase the cost and reduce the validity of the resulting recourses.
arXiv Detail & Related papers (2023-09-28T13:59:50Z) - Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework.
Our importance weights are obtained by optimizing the KL-divergence regularized loss function.
Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z) - Feature Separation and Recalibration for Adversarial Robustness [18.975320671203132]
We propose a novel, easy-to- verify approach named Feature Separation and Recalibration.
It recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration.
It improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead.
arXiv Detail & Related papers (2023-03-24T07:43:57Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature.
We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance.
By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z) - Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting.
Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.