ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
- URL: http://arxiv.org/abs/2312.11523v1
- Date: Wed, 13 Dec 2023 08:25:07 GMT
- Title: ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
- Authors: Xinpeng Wang, Xiaoyuan Yi, Han Jiang, Shanlin Zhou, Zhihua Wei, Xing
Xie
- Abstract summary: Recent large-scale Visual-Language Generative Models (VLGMs) have achieved unprecedented improvement in multimodal image/text generation.
These models might also generate toxic content, e.g., offensive text and pornography images, raising significant ethical risks.
This work delves into the propensity for toxicity generation and susceptibility to toxic data across various VLGMs.
- Score: 36.60526586838288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Warning: this paper includes model outputs showing offensive content. Recent
large-scale Visual-Language Generative Models (VLGMs) have achieved
unprecedented improvement in multimodal image/text generation. However, these
models might also generate toxic content, e.g., offensive text and pornography
images, raising significant ethical risks. Despite exhaustive studies on toxic
degeneration of language models, this problem remains largely unexplored within
the context of visual-language generation. This work delves into the propensity
for toxicity generation and susceptibility to toxic data across various VLGMs.
For this purpose, we built ToViLaG, a dataset comprising 32K
co-toxic/mono-toxic text-image pairs and 1K innocuous but evocative text that
tends to stimulate toxicity. Furthermore, we propose WInToRe, a novel toxicity
metric tailored to visual-language generation, which theoretically reflects
different aspects of toxicity considering both input and output. On such a
basis, we benchmarked the toxicity of a diverse spectrum of VLGMs and
discovered that some models do more evil than expected while some are more
vulnerable to infection, underscoring the necessity of VLGMs detoxification.
Therefore, we develop an innovative bottleneck-based detoxification method. Our
method could reduce toxicity while maintaining comparable generation quality,
providing a promising initial solution to this line of research.
Related papers
- FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts [13.470734853274587]
Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language.
We create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts.
We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity.
arXiv Detail & Related papers (2024-06-25T14:02:11Z) - Mitigating Text Toxicity with Counterfactual Generation [0.3250512744763586]
Toxicity mitigation consists in rephrasing text in order to remove harmful meaning.
Current methods fail to detoxify text while preserving the initial non-toxic meaning.
This work is the first to bridge the gap between counterfactual generation and text detoxification.
arXiv Detail & Related papers (2024-05-16T09:52:21Z) - Parameter-Efficient Detoxification with Contrastive Decoding [78.5124331048714]
We introduce Detoxification Generator (DETOXIGEN), an inference-time algorithm that steers the generation away from unwanted styles.
During the actual generation, we use the trained detoxifier to produce undesirable tokens for the generator to contrast against at each decoding step.
We find that it significantly outperforms previous approaches in detoxification metrics while not compromising on the generation quality.
arXiv Detail & Related papers (2024-01-13T01:46:20Z) - Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding [75.06872859716049]
Large Language Models (LLMs) have demonstrated a powerful ability for text generation.
undesired behaviors such as toxicity or hallucinations can manifest.
We propose formalizing text generation as a future-constrained generation problem.
arXiv Detail & Related papers (2023-12-11T06:35:33Z) - Unveiling the Implicit Toxicity in Large Language Models [77.90933074675543]
The open-endedness of large language models (LLMs) combined with their impressive capabilities may lead to new safety issues when being exploited for malicious use.
We show that LLMs can generate diverse implicit toxic outputs that are exceptionally difficult to detect via simply zero-shot prompting.
We propose a reinforcement learning (RL) based attacking method to further induce the implicit toxicity in LLMs.
arXiv Detail & Related papers (2023-11-29T06:42:36Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and
Implicit Hate Speech Detection [33.715318646717385]
ToxiGen is a large-scale dataset of 274k toxic and benign statements about 13 minority groups.
Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale.
We find that 94.5% of toxic examples are labeled as hate speech by human annotators.
arXiv Detail & Related papers (2022-03-17T17:57:56Z) - Leashing the Inner Demons: Self-Detoxification for Language Models [13.576289320208511]
Language models (LMs) can reproduce (or amplify) toxic language seen during training.
We analyze the impact of prompts, decoding strategies and training corpora on the output.
We propose a simple yet effective method for language models to "detoxify" themselves without an additional large corpus or external discriminator.
arXiv Detail & Related papers (2022-03-06T23:55:12Z) - RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language
Models [93.151822563361]
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment.
We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration.
arXiv Detail & Related papers (2020-09-24T03:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.