Related papers: Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

URL: http://arxiv.org/abs/2507.22617v1
Date: Wed, 30 Jul 2025 12:37:29 GMT
Title: Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions
Authors: Yiting Qu, Ziqing Yang, Yihan Ma, Michael Backes, Savvas Zannettou, Yang Zhang,
Abstract summary: We investigate the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models.<n>We generate 1,860 optical illusions using Stable Diffusion and ControlNet conditioned on 62 hate messages.<n>Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset.
Score: 26.051334752537546
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions--visual tricks that create different perceptions of reality. However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities. In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models. Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages. Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset. Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions. Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs. We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages. To address this risk, we explore preliminary mitigation measures and identify the most effective approaches from the perspectives of image transformations and training-level strategies.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
SemVink: Advancing VLMs' Semantic Understanding of Optical Illusions via Visual Global Thinking [12.215295420714787]
Vision-language models (VLMs) excel in semantic tasks but falter at a core human capability.<n>We introduce HC-Bench, a benchmark of 112 images with hidden text, objects, and illusions.<n>We propose SemVink (Semantic Visual Thinking), which unlocks >99% accuracy by eliminating redundant visual noise.
arXiv Detail & Related papers (2025-06-03T12:33:47Z)
IllusionBench+: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models [56.34742191010987]
Current Visual Language Models (VLMs) show impressive image understanding but struggle with visual illusions.<n>We introduce IllusionBench, a comprehensive visual illusion dataset that encompasses classic cognitive illusions and real-world scene illusions.<n>We design trap illusions that resemble classical patterns but differ in reality, highlighting issues in SOTA models.
arXiv Detail & Related papers (2025-01-01T14:10:25Z)
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector [97.92369017531038]
We build a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR) We then develop a novel iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of Visual Language Models (VLMs) to achieve the detection of adversarial images against benign ones in the input.
arXiv Detail & Related papers (2024-10-30T10:33:10Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts.<n>The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts.<n> concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
Counterfactual Explanations for Face Forgery Detection via Adversarial Removal of Artifacts [23.279652897139286]
Highly realistic AI generated face forgeries known as deepfakes have raised serious social concerns. We provide counterfactual explanations for face forgery detection from an artifact removal perspective. Our method achieves over 90% attack success rate and superior attack transferability.
arXiv Detail & Related papers (2024-04-12T09:13:37Z)
Diffusion Illusions: Hiding Images in Plain Sight [37.87050866208039]
Diffusion Illusions is the first comprehensive pipeline designed to automatically generate a wide range of illusions. We study three types of illusions, each where the prime images are arranged in different ways. We conduct comprehensive experiments on these illusions and verify the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-12-06T18:59:18Z)
MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning [59.988458964353754]
Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Existing approaches perturb user images in imperceptible way to render them "unlearnable" from malicious uses. We propose MetaCloak, which solves the bi-level poisoning problem with a meta-learning framework.
arXiv Detail & Related papers (2023-11-22T03:31:31Z)
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z)
AdvDrop: Adversarial Attack to DNNs by Dropping Information [12.090562737098407]
We propose a novel adversarial attack, named textitAdvDrop, which crafts adversarial examples by dropping existing information of images. We demonstrate the effectiveness of textitAdvDrop by extensive experiments, and show that this new type of adversarial examples is more difficult to be defended by current defense systems.
arXiv Detail & Related papers (2021-08-20T07:46:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.