Related papers: Learn What NOT to Learn: Towards Generative Safety in Chatbots

Learn What NOT to Learn: Towards Generative Safety in Chatbots

URL: http://arxiv.org/abs/2304.11220v2
Date: Tue, 25 Apr 2023 08:16:47 GMT
Title: Learn What NOT to Learn: Towards Generative Safety in Chatbots
Authors: Leila Khalatbari, Yejin Bang, Dan Su, Willy Chung, Saeed Ghadimi, Hossein Sameti, Pascale Fung
Abstract summary: We present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models.
Score: 40.8106410437709
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational models that are generative and open-domain are particularly susceptible to generating unsafe content since they are trained on web-based social data. Prior approaches to mitigating this issue have drawbacks, such as disrupting the flow of conversation, limited generalization to unseen toxic input contexts, and sacrificing the quality of the dialogue for the sake of safety. In this paper, we present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals. Our approach differs from the standard contrastive learning framework in that it automatically obtains positive and negative signals from the safe and unsafe language distributions that have been learned beforehand. The LOT framework utilizes divergence to steer the generations away from the unsafe subspace and towards the safe subspace while sustaining the flow of conversation. Our approach is memory and time-efficient during decoding and effectively reduces toxicity while preserving engagingness and fluency. Empirical results indicate that LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models. Our findings are further corroborated by human evaluation.

Related papers

Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation [3.7678366606419345]
Translating toxic content between low-resource language pairs poses challenges due to scarce parallel data and safety filters that sanitize offensive expressions.<n>We propose a two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus.<n>By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications.
arXiv Detail & Related papers (2025-07-16T06:58:02Z)
Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions [35.28819408507869]
Concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases.<n>We propose a novel self-discovery approach to identify a semantic direction vector in the embedding space to restrict text embedding within a safe region.<n>Our method can effectively reduce NSFW content and social bias generated by diffusion models compared to several state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-21T12:10:26Z)
Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning [23.71517734919702]
Vision-language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs. Current alignment strategies rely on supervised safety fine-tuning with curated datasets. We show that supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses.
arXiv Detail & Related papers (2025-03-14T19:52:08Z)
Backtracking for Safety [11.141166381133054]
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks, but ensuring their safety and alignment with human values remains crucial. Current safety alignment methods, such as supervised fine-tuning and reinforcement learning-based approaches, can exhibit vulnerabilities to adversarial attacks. We propose a novel backtracking method designed to address these limitations.
arXiv Detail & Related papers (2025-03-11T22:04:22Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges. We propose SAFREE, a training-free approach for safe T2I and T2V. We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z)
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms. We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses. Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z)
Improving Dialog Safety using Socially Aware Contrastive Learning [8.503001932363704]
We study prosociality in both adversarial and casual dialog contexts. We propose a dual-step fine-tuning process to address these issues. We train a base model that integrates prosocial behavior by leveraging datasets like Moral Integrity Corpus (MIC) and ProsocialDialog.
arXiv Detail & Related papers (2024-02-01T09:24:33Z)
Improving the Robustness of Knowledge-Grounded Dialogue via Contrastive Learning [71.8876256714229]
We propose an entity-based contrastive learning framework for improving the robustness of knowledge-grounded dialogue systems. Our method achieves new state-of-the-art performance in terms of automatic evaluation scores.
arXiv Detail & Related papers (2024-01-09T05:16:52Z)
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [7.5520641322945785]
Large Language Models' safety remains a critical concern due to their vulnerability to adversarial attacks. We introduce the Adversarial Prompt Shield (APS), a lightweight model that excels in detection accuracy and demonstrates resilience against adversarial prompts. We also propose novel strategies for autonomously generating adversarial training datasets.
arXiv Detail & Related papers (2023-10-31T22:22:10Z)
Using In-Context Learning to Improve Dialogue Safety [45.303005593685036]
We investigate a retrieval-based method for reducing bias and toxicity in responses from chatbots. It uses in-context learning to steer a model towards safer generations. We find our method performs competitively with strong baselines without requiring training.
arXiv Detail & Related papers (2023-02-02T04:46:03Z)
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response. We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z)
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.