Healing Unsafe Dialogue Responses with Weak Supervision Signals
- URL: http://arxiv.org/abs/2305.15757v1
- Date: Thu, 25 May 2023 06:15:53 GMT
- Title: Healing Unsafe Dialogue Responses with Weak Supervision Signals
- Authors: Zi Liang, Pinghui Wang, Ruofei Zhang, Shuo Zhang, Xiaofan Ye Yi Huang,
Junlan Feng
- Abstract summary: Unsupervised pseudo-label sampling method, TEMP, can automatically assign potential safe responses.
Our TEMP method group responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy.
Experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals.
- Score: 24.749797310489253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen increasing concerns about the unsafe response
generation of large-scale dialogue systems, where agents will learn offensive
or biased behaviors from the real-world corpus. Some methods are proposed to
address the above issue by detecting and replacing unsafe training examples in
a pipeline style. Though effective, they suffer from a high annotation cost and
adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the
neglect of providing safe responses (e.g. simply replacing with templates) will
cause the information-missing problem of dialogues. To address these issues, we
propose an unsupervised pseudo-label sampling method, TEMP, that can
automatically assign potential safe responses. Specifically, our TEMP method
groups responses into several clusters and samples multiple labels with an
adaptively sharpened sampling strategy, inspired by the observation that unsafe
samples in the clusters are usually few and distribute in the tail. Extensive
experiments in chitchat and task-oriented dialogues show that our TEMP
outperforms state-of-the-art models with weak supervision signals and obtains
comparable results under unsupervised learning settings.
Related papers
- Corpus Poisoning via Approximate Greedy Gradient Descent [48.5847914481222]
We propose Approximate Greedy Gradient Descent, a new attack on dense retrieval systems based on the widely used HotFlip method for generating adversarial passages.
We show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains.
arXiv Detail & Related papers (2024-06-07T17:02:35Z) - Contactless Fingerprint Biometric Anti-Spoofing: An Unsupervised Deep
Learning Approach [0.0]
We introduce an innovative anti-spoofing approach that combines an unsupervised autoencoder with a convolutional block attention module.
The scheme has achieved an average BPCER of 0.96% with an APCER of 1.6% for presentation attacks involving various types of spoofed samples.
arXiv Detail & Related papers (2023-11-07T17:19:59Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Confidence-driven Sampling for Backdoor Attacks [49.72680157684523]
Backdoor attacks aim to surreptitiously insert malicious triggers into DNN models, granting unauthorized control during testing scenarios.
Existing methods lack robustness against defense strategies and predominantly focus on enhancing trigger stealthiness while randomly selecting poisoned samples.
We introduce a straightforward yet highly effective sampling methodology that leverages confidence scores. Specifically, it selects samples with lower confidence scores, significantly increasing the challenge for defenders in identifying and countering these attacks.
arXiv Detail & Related papers (2023-10-08T18:57:36Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - Identifying Adversarially Attackable and Robust Samples [1.4213973379473654]
Adrial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models.
This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks.
We propose a deep-learning-based detector to identify the adversarially attackable and robust samples in an unseen dataset for an unseen target model.
arXiv Detail & Related papers (2023-01-30T13:58:14Z) - Generalizable Black-Box Adversarial Attack with Meta Learning [54.196613395045595]
In black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful perturbation based on query feedback under a query budget.
We propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability.
The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance.
arXiv Detail & Related papers (2023-01-01T07:24:12Z) - Adversarially Robust One-class Novelty Detection [83.1570537254877]
We show that existing novelty detectors are susceptible to adversarial examples.
We propose a defense strategy that manipulates the latent space of novelty detectors to improve the robustness against adversarial examples.
arXiv Detail & Related papers (2021-08-25T10:41:29Z) - An Adversarially-Learned Turing Test for Dialog Generation Models [45.991035017908594]
We propose an adversarial training approach to learn a robust model, ATT, that discriminates machine-generated responses from human-written replies.
In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples.
Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.
arXiv Detail & Related papers (2021-04-16T17:13:14Z) - Investigating Robustness of Adversarial Samples Detection for Automatic
Speaker Verification [78.51092318750102]
This work proposes to defend ASV systems against adversarial attacks with a separate detection network.
A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples.
arXiv Detail & Related papers (2020-06-11T04:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.