Safety and Fairness for Content Moderation in Generative Models
- URL: http://arxiv.org/abs/2306.06135v1
- Date: Fri, 9 Jun 2023 01:37:32 GMT
- Title: Safety and Fairness for Content Moderation in Generative Models
- Authors: Susan Hao, Piyush Kumar, Sarah Laszlo, Shivani Poddar, Bhaktipriya
Radharapu, Renee Shelby
- Abstract summary: We provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies.
We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain.
We conclude with a summary of how the style of harms we demonstrate enables data-driven content moderation decisions.
- Score: 0.7992463811844456
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With significant advances in generative AI, new technologies are rapidly
being deployed with generative components. Generative models are typically
trained on large datasets, resulting in model behaviors that can mimic the
worst of the content in the training data. Responsible deployment of generative
technologies requires content moderation strategies, such as safety input and
output filters. Here, we provide a theoretical framework for conceptualizing
responsible content moderation of text-to-image generative technologies,
including a demonstration of how to empirically measure the constructs we
enumerate. We define and distinguish the concepts of safety, fairness, and
metric equity, and enumerate example harms that can come in each domain. We
then provide a demonstration of how the defined harms can be quantified. We
conclude with a summary of how the style of harms quantification we demonstrate
enables data-driven content moderation decisions.
Related papers
- Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models [24.851041038347784]
characterization allows us to use our framework to audit models and prompt-datasets.
We implement Concept2Concept as an open-source interactive visualization tool facilitating use by non-technical end-users.
arXiv Detail & Related papers (2024-10-06T21:42:53Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion [51.931083971448885]
We propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images.
Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
arXiv Detail & Related papers (2024-07-17T05:21:41Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation.
We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Harm Amplification in Text-to-Image Models [5.397559484007124]
Text-to-image (T2I) models have emerged as a significant advancement in generative AI.
There exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts.
This phenomenon, where T2I models generate harmful representations that were not explicit in the input prompt, poses a potentially greater risk than adversarial prompts.
arXiv Detail & Related papers (2024-02-01T23:12:57Z) - A Holistic Approach to Undesired Content Detection in the Real World [4.626056557184189]
We present a holistic approach to building a robust natural language classification system for real-world content moderation.
The success of such a system relies on a chain of carefully designed and executed steps, including the design of content and labeling instructions.
Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment.
arXiv Detail & Related papers (2022-08-05T16:47:23Z) - A Hazard Analysis Framework for Code Synthesis Large Language Models [2.535935501467612]
Codex, a large language model (LLM) trained on a variety of Codes, exceeds the previous state of the art in its capacity to synthesize and generate code.
This paper outlines a hazard analysis framework constructed at OpenAI to uncover hazards or safety risks that the deployment of models like Codex may impose technically, socially, politically, and economically.
arXiv Detail & Related papers (2022-07-25T20:44:40Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.