Learn What NOT to Learn: Towards Generative Safety in Chatbots
- URL: http://arxiv.org/abs/2304.11220v2
- Date: Tue, 25 Apr 2023 08:16:47 GMT
- Title: Learn What NOT to Learn: Towards Generative Safety in Chatbots
- Authors: Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi,
Hossein Sameti, Pascale Fung
- Abstract summary: We present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals.
LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models.
- Score: 40.8106410437709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational models that are generative and open-domain are particularly
susceptible to generating unsafe content since they are trained on web-based
social data. Prior approaches to mitigating this issue have drawbacks, such as
disrupting the flow of conversation, limited generalization to unseen toxic
input contexts, and sacrificing the quality of the dialogue for the sake of
safety. In this paper, we present a novel framework, named "LOT" (Learn NOT
to), that employs a contrastive loss to enhance generalization by learning from
both positive and negative training signals. Our approach differs from the
standard contrastive learning framework in that it automatically obtains
positive and negative signals from the safe and unsafe language distributions
that have been learned beforehand. The LOT framework utilizes divergence to
steer the generations away from the unsafe subspace and towards the safe
subspace while sustaining the flow of conversation. Our approach is memory and
time-efficient during decoding and effectively reduces toxicity while
preserving engagingness and fluency. Empirical results indicate that LOT
reduces toxicity by up to four-fold while achieving four to six-fold higher
rates of engagingness and fluency compared to baseline models. Our findings are
further corroborated by human evaluation.
Related papers
- SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.
We propose SAFREE, a training-free approach for safe T2I and T2V.
We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms.
We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses.
Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z) - Improving Dialog Safety using Socially Aware Contrastive Learning [8.503001932363704]
We study prosociality in both adversarial and casual dialog contexts.
We propose a dual-step fine-tuning process to address these issues.
We train a base model that integrates prosocial behavior by leveraging datasets like Moral Integrity Corpus (MIC) and ProsocialDialog.
arXiv Detail & Related papers (2024-02-01T09:24:33Z) - Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive
Learning [71.8876256714229]
We propose an entity-based contrastive learning framework for improving the robustness of knowledge-grounded dialogue systems.
Our method achieves new state-of-the-art performance in terms of automatic evaluation scores.
arXiv Detail & Related papers (2024-01-09T05:16:52Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots.
It uses in-context learning to steer a model towards safer generations.
We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.