Healing Unsafe Dialogue Responses with Weak Supervision Signals
- URL: http://arxiv.org/abs/2305.15757v1
- Date: Thu, 25 May 2023 06:15:53 GMT
- Title: Healing Unsafe Dialogue Responses with Weak Supervision Signals
- Authors: Zi Liang, Pinghui Wang, Ruofei Zhang, Shuo Zhang, Xiaofan Ye Yi Huang,
Junlan Feng
- Abstract summary: Unsupervised pseudo-label sampling method, TEMP, can automatically assign potential safe responses.
Our TEMP method group responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy.
Experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals.
- Score: 24.749797310489253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen increasing concerns about the unsafe response
generation of large-scale dialogue systems, where agents will learn offensive
or biased behaviors from the real-world corpus. Some methods are proposed to
address the above issue by detecting and replacing unsafe training examples in
a pipeline style. Though effective, they suffer from a high annotation cost and
adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the
neglect of providing safe responses (e.g. simply replacing with templates) will
cause the information-missing problem of dialogues. To address these issues, we
propose an unsupervised pseudo-label sampling method, TEMP, that can
automatically assign potential safe responses. Specifically, our TEMP method
groups responses into several clusters and samples multiple labels with an
adaptively sharpened sampling strategy, inspired by the observation that unsafe
samples in the clusters are usually few and distribute in the tail. Extensive
experiments in chitchat and task-oriented dialogues show that our TEMP
outperforms state-of-the-art models with weak supervision signals and obtains
comparable results under unsupervised learning settings.
Related papers
- Benchmarking Misuse Mitigation Against Covert Adversaries [80.74502950627736]
Existing language model safety evaluations focus on overt attacks and low-stakes tasks.<n>We develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses.<n>Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
arXiv Detail & Related papers (2025-06-06T17:33:33Z) - The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems [101.68501850486179]
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities.<n>This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set.<n>We propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG.
arXiv Detail & Related papers (2025-05-24T08:19:25Z) - Stealthy LLM-Driven Data Poisoning Attacks Against Embedding-Based Retrieval-Augmented Recommender Systems [16.79952669254101]
We study provider-side data poisoning in retrieval-augmented recommender systems (RAG)<n>By modifying only a small fraction of tokens within item descriptions, an attacker can significantly promote or demote targeted items.<n>Experiments on MovieLens, using two large language model (LLM) retrieval modules, show that even subtle attacks shift final rankings and item exposures while eluding naive detection.
arXiv Detail & Related papers (2025-05-08T12:53:42Z) - A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation [93.28532038721816]
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields.
We propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples.
arXiv Detail & Related papers (2025-04-11T10:18:13Z) - Corpus Poisoning via Approximate Greedy Gradient Descent [48.5847914481222]
We propose Approximate Greedy Gradient Descent, a new attack on dense retrieval systems based on the widely used HotFlip method for generating adversarial passages.
We show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains.
arXiv Detail & Related papers (2024-06-07T17:02:35Z) - Contactless Fingerprint Biometric Anti-Spoofing: An Unsupervised Deep
Learning Approach [0.0]
We introduce an innovative anti-spoofing approach that combines an unsupervised autoencoder with a convolutional block attention module.
The scheme has achieved an average BPCER of 0.96% with an APCER of 1.6% for presentation attacks involving various types of spoofed samples.
arXiv Detail & Related papers (2023-11-07T17:19:59Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Confidence-driven Sampling for Backdoor Attacks [49.72680157684523]
Backdoor attacks aim to surreptitiously insert malicious triggers into DNN models, granting unauthorized control during testing scenarios.
Existing methods lack robustness against defense strategies and predominantly focus on enhancing trigger stealthiness while randomly selecting poisoned samples.
We introduce a straightforward yet highly effective sampling methodology that leverages confidence scores. Specifically, it selects samples with lower confidence scores, significantly increasing the challenge for defenders in identifying and countering these attacks.
arXiv Detail & Related papers (2023-10-08T18:57:36Z) - Defending Pre-trained Language Models as Few-shot Learners against
Backdoor Attacks [72.03945355787776]
We advocate MDP, a lightweight, pluggable, and effective defense for PLMs as few-shot learners.
We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness.
arXiv Detail & Related papers (2023-09-23T04:41:55Z) - Identifying Adversarially Attackable and Robust Samples [1.4213973379473654]
Adrial attacks insert small, imperceptible perturbations to input samples that cause large, undesired changes to the output of deep learning models.
This work introduces the notion of sample attackability, where we aim to identify samples that are most susceptible to adversarial attacks.
We propose a deep-learning-based detector to identify the adversarially attackable and robust samples in an unseen dataset for an unseen target model.
arXiv Detail & Related papers (2023-01-30T13:58:14Z) - Generalizable Black-Box Adversarial Attack with Meta Learning [54.196613395045595]
In black-box adversarial attack, the target model's parameters are unknown, and the attacker aims to find a successful perturbation based on query feedback under a query budget.
We propose to utilize the feedback information across historical attacks, dubbed example-level adversarial transferability.
The proposed framework with the two types of adversarial transferability can be naturally combined with any off-the-shelf query-based attack methods to boost their performance.
arXiv Detail & Related papers (2023-01-01T07:24:12Z) - Adversarially Robust One-class Novelty Detection [83.1570537254877]
We show that existing novelty detectors are susceptible to adversarial examples.
We propose a defense strategy that manipulates the latent space of novelty detectors to improve the robustness against adversarial examples.
arXiv Detail & Related papers (2021-08-25T10:41:29Z) - An Adversarially-Learned Turing Test for Dialog Generation Models [45.991035017908594]
We propose an adversarial training approach to learn a robust model, ATT, that discriminates machine-generated responses from human-written replies.
In contrast to previous perturbation-based methods, our discriminator is trained by iteratively generating unrestricted and diverse adversarial examples.
Our discriminator shows high accuracy on strong attackers including DialoGPT and GPT-3.
arXiv Detail & Related papers (2021-04-16T17:13:14Z) - Investigating Robustness of Adversarial Samples Detection for Automatic
Speaker Verification [78.51092318750102]
This work proposes to defend ASV systems against adversarial attacks with a separate detection network.
A VGG-like binary classification detector is introduced and demonstrated to be effective on detecting adversarial samples.
arXiv Detail & Related papers (2020-06-11T04:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.