Learn What NOT to Learn: Towards Generative Safety in Chatbots
- URL: http://arxiv.org/abs/2304.11220v2
- Date: Tue, 25 Apr 2023 08:16:47 GMT
- Title: Learn What NOT to Learn: Towards Generative Safety in Chatbots
- Authors: Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi,
Hossein Sameti, Pascale Fung
- Abstract summary: We present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals.
LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models.
- Score: 40.8106410437709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conversational models that are generative and open-domain are particularly
susceptible to generating unsafe content since they are trained on web-based
social data. Prior approaches to mitigating this issue have drawbacks, such as
disrupting the flow of conversation, limited generalization to unseen toxic
input contexts, and sacrificing the quality of the dialogue for the sake of
safety. In this paper, we present a novel framework, named "LOT" (Learn NOT
to), that employs a contrastive loss to enhance generalization by learning from
both positive and negative training signals. Our approach differs from the
standard contrastive learning framework in that it automatically obtains
positive and negative signals from the safe and unsafe language distributions
that have been learned beforehand. The LOT framework utilizes divergence to
steer the generations away from the unsafe subspace and towards the safe
subspace while sustaining the flow of conversation. Our approach is memory and
time-efficient during decoding and effectively reduces toxicity while
preserving engagingness and fluency. Empirical results indicate that LOT
reduces toxicity by up to four-fold while achieving four to six-fold higher
rates of engagingness and fluency compared to baseline models. Our findings are
further corroborated by human evaluation.
Related papers
- Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start.
Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z) - Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning [23.71517734919702]
Vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs.
Current alignment strategies rely on supervised safety fine-tuning with curated datasets.
We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.
arXiv Detail & Related papers (2025-03-14T19:52:08Z) - Backtracking for Safety [11.141166381133054]
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial.
Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks.
We propose a novel backtracking method designed to address these limitations.
arXiv Detail & Related papers (2025-03-11T22:04:22Z) - ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.
This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.
Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.
We propose SAFREE, a training-free approach for safe T2I and T2V.
We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms.
We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses.
Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z) - Improving Dialog Safety using Socially Aware Contrastive Learning [8.503001932363704]
We study prosociality in both adversarial and casual dialog contexts.
We propose a dual-step fine-tuning process to address these issues.
We train a base model that integrates prosocial behavior by leveraging datasets like Moral Integrity Corpus (MIC) and ProsocialDialog.
arXiv Detail & Related papers (2024-02-01T09:24:33Z) - Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive
Learning [71.8876256714229]
We propose an entity-based contrastive learning framework for improving the robustness of knowledge-grounded dialogue systems.
Our method achieves new state-of-the-art performance in terms of automatic evaluation scores.
arXiv Detail & Related papers (2024-01-09T05:16:52Z) - Robust Safety Classifier for Large Language Models: Adversarial Prompt
Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks.
We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts.
We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z) - Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots.
It uses in-context learning to steer a model towards safer generations.
We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.