Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation
- URL: http://arxiv.org/abs/2212.01810v1
- Date: Sun, 4 Dec 2022 12:23:41 GMT
- Title: Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation
- Authors: Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Fei Mi, Yasheng Wang,
Lifeng Shang, Minlie Huang
- Abstract summary: We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
- Score: 65.48908724440047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pretrained language models can easily produce toxic or biased content,
which is prohibitive for practical use. In order to detect such toxic
generations, existing methods rely on templates, real-world data extraction,
crowdsourcing workers, or automatic generation to construct adversarial
contexts that are likely to induce toxic generations. However, what type of
context is more likely to induce unsafe responses is still under-explored. In
this paper, we identify that context toxicity and context category (e.g.,
\textit{profanity}, \textit{insult}, \textit{drugs}, etc.) are two important
factors to cause safety issues in response generation. Hence, we propose a
method called \emph{reverse generation} to construct adversarial contexts
conditioned on a given response, with the flexibility to control category,
toxicity level, and inductivity of the generated contexts. Via reverse
generation, we augment the existing BAD dataset and construct a new dataset
BAD+ which contains more than 120K diverse and highly inductive contexts in 12
categories. We test three popular pretrained dialogue models (Blender,
DialoGPT, and Plato2) and find that BAD+ can largely expose their safety
problems. Furthermore, we show that BAD+ can greatly enhance the safety of
generation and reveal the key factors of safety improvement. Our code and
dataset is available at \url{https://github.com/thu-coai/Reverse_Generation}.
Related papers
- Toxic Subword Pruning for Dialogue Response Generation on Large Language Models [51.713448010799986]
We propose textbfToxic Subword textbfPruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs.
ToxPrune simultaneously improves the toxic language model NSFW-3B on the task of dialogue response generation obviously.
arXiv Detail & Related papers (2024-10-05T13:30:33Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Fine-Grained Detoxification via Instance-Level Prefixes for Large
Language Models [26.474136481185724]
Fine-grained detoxification via instance-level prefixes (FGDILP) to mitigate toxic text without additional cost.
FGDILP contrasts the contextualized representation in attention space using a positive prefix-prepended prompt.
We validate that FGDILP enables controlled text generation with regard to toxicity at both the utterance and context levels.
arXiv Detail & Related papers (2024-02-23T09:04:48Z) - Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding [75.06872859716049]
Large Language Models (LLMs) have demonstrated a powerful ability for text generation.
undesired behaviors such as toxicity or hallucinations can manifest.
We propose formalizing text generation as a future-constrained generation problem.
arXiv Detail & Related papers (2023-12-11T06:35:33Z) - FLIRT: Feedback Loop In-context Red Teaming [71.38594755628581]
We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
arXiv Detail & Related papers (2023-08-08T14:03:08Z) - Learn What NOT to Learn: Towards Generative Safety in Chatbots [40.8106410437709]
We present a novel framework, named "LOT" (Learn NOT to), that employs a contrastive loss to enhance generalization by learning from both positive and negative training signals.
LOT reduces toxicity by up to four-fold while achieving four to six-fold higher rates of engagingness and fluency compared to baseline models.
arXiv Detail & Related papers (2023-04-21T18:59:06Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.