Related papers: SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

URL: http://arxiv.org/abs/2510.21120v1
Date: Fri, 24 Oct 2025 03:19:48 GMT
Title: SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
Authors: Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng,
Abstract summary: We introduce SafetyPairs, a framework for generating counterfactual pairs of images that differ only in the features relevant to the given safety policy.<n>Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data.<n>We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories.
Score: 5.313750874857107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that highlights weaknesses in vision-language models' abilities to distinguish between subtly different images. Beyond evaluation, we find our pipeline serves as an effective data augmentation strategy that improves the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.

Related papers

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation [21.845417608250035]
diffusion-based T2I models have achieved remarkable image generation quality.<n>They also enable easy creation of harmful content.<n>Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask.
arXiv Detail & Related papers (2025-11-14T07:04:06Z)
Reimagining Safety Alignment with An Image [49.33281424100804]
Large language models (LLMs) excel in diverse applications but face dual challenges: generating harmful content under jailbreak attacks and over-refusal of benign queries.<n>We propose Magic Image, an optimization-driven visual prompt framework that enhances security while reducing over-refusal.
arXiv Detail & Related papers (2025-11-01T11:27:07Z)
SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing [13.35302137193851]
We propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module.<n>We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content.<n>We develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images.
arXiv Detail & Related papers (2025-10-28T15:12:15Z)
SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability [49.074914896839466]
We introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency.<n>Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function.<n>We show that SafeVision achieves state-of-the-art performance on different benchmarks.
arXiv Detail & Related papers (2025-10-28T00:35:59Z)
SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z)
MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.<n>It is crucial to identify such unsafe images based on established safety rules.<n>Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs.<n>We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images.<n>We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images [24.447395464275942]
We propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers.<n>First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe.<n>We then evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers powered by general-purpose visual language models.
arXiv Detail & Related papers (2024-05-06T13:57:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.