Learn What NOT to Learn: Towards Generative Safety in Chatbots
- URL: http://arxiv.org/abs/2304.11220v2
- Date: Tue, 25 Apr 2023 08:16:47 GMT
- Title: Learn What NOT to Learn: Towards Generative Safety in Chatbots
- Authors: Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi,
Hossein Sameti, Pascale Fung
- Abstract summary: We present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals.
LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models.
- Score: 40.8106410437709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational models that are generative and open-domain are particularly
susceptible to generating unsafe content since they are trained on web-based
social data. Prior approaches to mitigating this issue have drawbacks, such as
disrupting the flow of conversation, limited generalization to unseen toxic
input contexts, and sacrificing the quality of the dialogue for the sake of
safety. In this paper, we present a novel framework, named "LOT" (Learn NOT
to), that employs a contrastive loss to enhance generalization by learning from
both positive and negative training signals. Our approach differs from the
standard contrastive learning framework in that it automatically obtains
positive and negative signals from the safe and unsafe language distributions
that have been learned beforehand. The LOT framework utilizes divergence to
steer the generations away from the unsafe subspace and towards the safe
subspace while sustaining the flow of conversation. Our approach is memory and
time-efficient during decoding and effectively reduces toxicity while
preserving engagingness and fluency. Empirical results indicate that LOT
reduces toxicity by up to four-fold while achieving four to six-fold higher
rates of engagingness and fluency compared to baseline models. Our findings are
further corroborated by human evaluation.
Related papers
- Improving Dialog Safety using Socially Aware Contrastive Learning [8.503001932363704]
We study prosociality in both adversarial and casual dialog contexts.
We propose a dual-step fine-tuning process to address these issues.
We train a base model that integrates prosocial behavior by leveraging datasets like Moral Integrity Corpus (MIC) and ProsocialDialog.
arXiv Detail & Related papers (2024-02-01T09:24:33Z) - Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive
Learning [71.8876256714229]
We propose an entity-based contrastive learning framework for improving the robustness of knowledge-grounded dialogue systems.
Our method achieves new state-of-the-art performance in terms of automatic evaluation scores.
arXiv Detail & Related papers (2024-01-09T05:16:52Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots.
It uses in-context learning to steer a model towards safer generations.
We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.