An Adversarially-Learned Turing Test for Dialog Generation Models
- URL: http://arxiv.org/abs/2104.08231v1
- Date: Fri, 16 Apr 2021 17:13:14 GMT
- Title: An Adversarially-Learned Turing Test for Dialog Generation Models
- Authors: Xiang Gao, Yizhe Zhang, Michel Galley, Bill Dolan
- Abstract summary: We propose an adversarial training approach to learn a robust model, ATT, that discriminates machine-generated responses from human-written replies.
In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples.
Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.
- Score: 45.991035017908594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The design of better automated dialogue evaluation metrics offers the
potential of accelerate evaluation research on conversational AI. However,
existing trainable dialogue evaluation models are generally restricted to
classifiers trained in a purely supervised manner, which suffer a significant
risk from adversarial attacking (e.g., a nonsensical response that enjoys a
high classification score). To alleviate this risk, we propose an adversarial
training approach to learn a robust model, ATT (Adversarial Turing Test), that
discriminates machine-generated responses from human-written replies. In
contrast to previous perturbation-based methods, our discriminator is trained
by iteratively generating unrestricted and diverse adversarial examples using
reinforcement learning. The key benefit of this unrestricted adversarial
training approach is allowing the discriminator to improve robustness in an
iterative attack-defense game. Our discriminator shows high accuracy on strong
attackers including DialoGPT and GPT-3.
Related papers
- Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Adversarial Robustness of Deep Reinforcement Learning based Dynamic
Recommender Systems [50.758281304737444]
We propose to explore adversarial examples and attack detection on reinforcement learning-based interactive recommendation systems.
We first craft different types of adversarial examples by adding perturbations to the input and intervening on the casual factors.
Then, we augment recommendation systems by detecting potential attacks with a deep learning-based classifier based on the crafted data.
arXiv Detail & Related papers (2021-12-02T04:12:24Z) - Model-Agnostic Meta-Attack: Towards Reliable Evaluation of Adversarial
Robustness [53.094682754683255]
We propose a Model-Agnostic Meta-Attack (MAMA) approach to discover stronger attack algorithms automatically.
Our method learns the in adversarial attacks parameterized by a recurrent neural network.
We develop a model-agnostic training algorithm to improve the ability of the learned when attacking unseen defenses.
arXiv Detail & Related papers (2021-10-13T13:54:24Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - ATRO: Adversarial Training with a Rejection Option [10.36668157679368]
This paper proposes a classification framework with a rejection option to mitigate the performance deterioration caused by adversarial examples.
Applying the adversarial training objective to both a classifier and a rejection function simultaneously, we can choose to abstain from classification when it has insufficient confidence to classify a test data point.
arXiv Detail & Related papers (2020-10-24T14:05:03Z) - Class-Aware Domain Adaptation for Improving Adversarial Robustness [27.24720754239852]
adversarial training has been proposed to train networks by injecting adversarial examples into the training data.
We propose a novel Class-Aware Domain Adaptation (CADA) method for adversarial defense without directly applying adversarial training.
arXiv Detail & Related papers (2020-05-10T03:45:19Z) - EnsembleGAN: Adversarial Learning for Retrieval-Generation Ensemble
Model on Short-Text Conversation [37.80290058812499]
ensembleGAN is an adversarial learning framework for enhancing a retrieval-generation ensemble model in open-domain conversation scenario.
It consists of a language-model-like generator, a ranker generator, and one ranker discriminator.
arXiv Detail & Related papers (2020-04-30T05:59:12Z) - Counterfactual Off-Policy Training for Neural Response Generation [94.76649147381232]
We propose to explore potential responses by counterfactual reasoning.
Training on the counterfactual responses under the adversarial learning framework helps to explore the high-reward area of the potential response space.
An empirical study on the DailyDialog dataset shows that our approach significantly outperforms the HRED model.
arXiv Detail & Related papers (2020-04-29T22:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.