Related papers: Safety and Fairness for Content Moderation in Generative Models

Safety and Fairness for Content Moderation in Generative Models

URL: http://arxiv.org/abs/2306.06135v1
Date: Fri, 9 Jun 2023 01:37:32 GMT
Title: Safety and Fairness for Content Moderation in Generative Models
Authors: Susan Hao, Piyush Kumar, Sarah Laszlo, Shivani Poddar, Bhaktipriya Radharapu, Renee Shelby
Abstract summary: We provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies. We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. We conclude with a summary of how the style of harms we demonstrate enables data-driven content moderation decisions.
Score: 0.7992463811844456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With significant advances in generative AI, new technologies are rapidly being deployed with generative components. Generative models are typically trained on large datasets, resulting in model behaviors that can mimic the worst of the content in the training data. Responsible deployment of generative technologies requires content moderation strategies, such as safety input and output filters. Here, we provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies, including a demonstration of how to empirically measure the constructs we enumerate. We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. We then provide a demonstration of how the defined harms can be quantified. We conclude with a summary of how the style of harms quantification we demonstrate enables data-driven content moderation decisions.

Related papers

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content. Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks. We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation. DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z)
Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI. We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z)
A Comprehensive Survey on Concept Erasure in Text-to-Image Diffusion Models [14.325284311928492]
Text-to-Image (T2I) models have made remarkable progress in generating high-quality, diverse visual content from natural language prompts. Their ability to reproduce copyrighted styles, sensitive imagery, and harmful content raises significant ethical and legal concerns. Concept erasure offers a proactive alternative to external filtering by modifying T2I models to prevent the generation of undesired content.
arXiv Detail & Related papers (2025-02-17T20:51:20Z)
Towards Safer Social Media Platforms: Scalable and Performant Few-Shot Harmful Content Moderation Using Large Language Models [9.42299478071576]
Harmful content on social media platforms poses significant risks to users and society. Current approaches rely on human moderators, supervised classifiers, and large volumes of training data. We utilize Large Language Models (LLMs) to undertake few-shot dynamic content moderation via in-context learning.
arXiv Detail & Related papers (2025-01-23T00:19:14Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs. We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [13.481343482138888]
We propose a vision-agnostic safe generation framework, Embedding Sanitizer (ES) ES focuses on erasing inappropriate concepts from prompt embeddings and uses the sanitized embeddings to guide the model for safe generation. ES significantly outperforms existing safeguards in terms of interpretability and controllability while maintaining generation quality.
arXiv Detail & Related papers (2024-11-15T16:29:02Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
Safeguard Text-to-Image Diffusion Models with Human Feedback Inversion [51.931083971448885]
We propose a framework named Human Feedback Inversion (HFI), where human feedback on model-generated images is condensed into textual tokens guiding the mitigation or removal of problematic images. Our experimental results demonstrate our framework significantly reduces objectionable content generation while preserving image quality, contributing to the ethical deployment of AI in the public sphere.
arXiv Detail & Related papers (2024-07-17T05:21:41Z)
Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation. We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z)
Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification. We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation. Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z)
Harm Amplification in Text-to-Image Models [5.397559484007124]
Text-to-image (T2I) models have emerged as a significant advancement in generative AI. There exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts. This phenomenon, where T2I models generate harmful representations that were not explicit in the input prompt, poses a potentially greater risk than adversarial prompts.
arXiv Detail & Related papers (2024-02-01T23:12:57Z)
A Holistic Approach to Undesired Content Detection in the Real World [4.626056557184189]
We present a holistic approach to building a robust natural language classification system for real-world content moderation. The success of such a system relies on a chain of carefully designed and executed steps, including the design of content and labeling instructions. Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment.
arXiv Detail & Related papers (2022-08-05T16:47:23Z)
A Hazard Analysis Framework for Code Synthesis Large Language Models [2.535935501467612]
Codex, a large language model (LLM) trained on a variety of Codes, exceeds the previous state of the art in its capacity to synthesize and generate code. This paper outlines a hazard analysis framework constructed at OpenAI to uncover hazards or safety risks that the deployment of models like Codex may impose technically, socially, politically, and economically.
arXiv Detail & Related papers (2022-07-25T20:44:40Z)
Generative Counterfactuals for Neural Networks via Attribute-Informed Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP) By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently. Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.