Moderator: Moderating Text-to-Image Diffusion Models through Fine-grained Context-based Policies
- URL: http://arxiv.org/abs/2408.07728v2
- Date: Thu, 12 Sep 2024 02:39:27 GMT
- Title: Moderator: Moderating Text-to-Image Diffusion Models through Fine-grained Context-based Policies
- Authors: Peiran Wang, Qiyu Li, Longxuan Yu, Ziyao Wang, Ang Li, Haojian Jin,
- Abstract summary: We present Moderator, a policy-based model management system that allows administrators to specify fine-grained content moderation policies.
We show that Moderator can prevent 65% of users from generating moderated content under 15 attempts and require the remaining users an average of 8.3 times more attempts to generate undesired content.
- Score: 11.085388940369851
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Moderator, a policy-based model management system that allows administrators to specify fine-grained content moderation policies and modify the weights of a text-to-image (TTI) model to make it significantly more challenging for users to produce images that violate the policies. In contrast to existing general-purpose model editing techniques, which unlearn concepts without considering the associated contexts, Moderator allows admins to specify what content should be moderated, under which context, how it should be moderated, and why moderation is necessary. Given a set of policies, Moderator first prompts the original model to generate images that need to be moderated, then uses these self-generated images to reverse fine-tune the model to compute task vectors for moderation and finally negates the original model with the task vectors to decrease its performance in generating moderated content. We evaluated Moderator with 14 participants to play the role of admins and found they could quickly learn and author policies to pass unit tests in approximately 2.29 policy iterations. Our experiment with 32 stable diffusion users suggested that Moderator can prevent 65% of users from generating moderated content under 15 attempts and require the remaining users an average of 8.3 times more attempts to generate undesired content.
Related papers
- Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.
The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.
concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - CCM: Adding Conditional Controls to Text-to-Image Consistency Models [89.75377958996305]
We consider alternative strategies for adding ControlNet-like conditional control to Consistency Models.
A lightweight adapter can be jointly optimized under multiple conditions through Consistency Training.
We study these three solutions across various conditional controls, including edge, depth, human pose, low-resolution image and masked image.
arXiv Detail & Related papers (2023-12-12T04:16:03Z) - Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators [19.401873797111662]
We conduct a model review on Hugging Face to reveal the availability of models to cover various moderation rules and guidelines.
We put state-of-the-art LLMs to the test, evaluating how well these models perform in flagging violations of platform rules from one particular forum.
Overall, we observe a non-trivial gap, as missing developed models and LLMs exhibit moderate to low performance on a significant portion of the rules.
arXiv Detail & Related papers (2023-11-14T03:18:28Z) - DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models [79.71665540122498]
We propose a method for detecting unauthorized data usage by planting the injected content into the protected dataset.
Specifically, we modify the protected images by adding unique contents on these images using stealthy image warping functions.
By analyzing whether the model has memorized the injected content, we can detect models that had illegally utilized the unauthorized data.
arXiv Detail & Related papers (2023-07-06T16:27:39Z) - Towards Intersectional Moderation: An Alternative Model of Moderation
Built on Care and Power [0.4351216340655199]
I perform a collaborative ethnography with moderators of r/AskHistorians, a community that uses an alternative moderation model.
I focus on three emblematic controversies of r/AskHistorians' alternative model of moderation.
I argue that designers should support decision-making processes and policy makers should account for the impact of sociotechnical systems.
arXiv Detail & Related papers (2023-05-18T18:27:52Z) - Multilingual Content Moderation: A Case Study on Reddit [23.949429463013796]
We propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 million Reddit comments.
We perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems.
Our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.
arXiv Detail & Related papers (2023-02-19T16:36:33Z) - Benchmarking Robustness to Adversarial Image Obfuscations [22.784762155781436]
Malicious actors may obfuscate policy violating images to prevent machine learning models from reaching the correct decision.
This benchmark, based on ImageNet, simulates the type of obfuscations created by malicious actors.
arXiv Detail & Related papers (2023-01-30T15:36:44Z) - Explainable Abuse Detection as Intent Classification and Slot Filling [66.80201541759409]
We introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone.
We show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.
arXiv Detail & Related papers (2022-10-06T03:33:30Z) - Reliable Decision from Multiple Subtasks through Threshold Optimization:
Content Moderation in the Wild [7.176020195419459]
Social media platforms struggle to protect users from harmful content through content moderation.
These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily.
Third-party content moderation services provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons.
We introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way.
arXiv Detail & Related papers (2022-08-16T03:51:43Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.