Related papers: SafeCFG: Controlling Harmful Features with Dynamic Safe Guidance for Safe Generation

Related papers

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z)
SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation [21.845417608250035]
diffusion-based T2I models have achieved remarkable image generation quality.<n>They also enable easy creation of harmful content.<n>Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask.
arXiv Detail & Related papers (2025-11-14T07:04:06Z)
SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z)
UpSafe$^\circ$C: Upcycling for Controllable Safety in Large Language Models [67.91151588917396]
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks.<n>We propose UpSafe$circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling.<n>Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
arXiv Detail & Related papers (2025-10-02T16:43:33Z)
GIFT: Gradient-aware Immunization of diffusion models against malicious Fine-Tuning with safe concepts retention [5.429335132446078]
GIFT: a Gradient-aware Immunization technique to defend diffusion models against malicious Fine-Tuning.
arXiv Detail & Related papers (2025-07-18T01:47:07Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Towards NSFW-Free Text-to-Image Generation via Safety-Constraint Direct Preference Optimization [30.31991120463517]
Existing studies fail to guarantee complete safety under potentially harmful concepts or struggle to balance safety with generation quality.<n>We propose Safety-Constrained Direct Preference Optimization (SC-DPO), a novel framework for safety alignment in T2I models.<n>SC-DPO integrates safety constraints into the general human preference calibration, aiming to maximize the likelihood of generating human-preferred samples.
arXiv Detail & Related papers (2025-04-19T13:26:46Z)
Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization [22.225141381422873]
There is a growing concern about text-to-image diffusion models creating harmful content. Post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed to mitigate these risks. We propose the safe generation framework Detect-and-Guide (DAG) to perform self-diagnosis and fine-interpret self-regulation. DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on real-world prompts.
arXiv Detail & Related papers (2025-03-19T13:37:52Z)
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals [50.463399903987245]
Large language models (LLMs) exhibit exceptional capabilities across various tasks but also pose risks by generating harmful content.<n>We show that LLMs can similarly perform internal assessments about safety in their internal states.<n>We propose SafeSwitch, a framework that regulates unsafe outputs by utilizing the prober-based internal state monitor.
arXiv Detail & Related papers (2025-02-03T04:23:33Z)
MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.<n>It is crucial to identify such unsafe images based on established safety rules.<n>Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts. We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components. This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [88.18235230849554]
Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. We leverage safe embeddings and a modified diffusion process with weighted tunable summation in the latent space to generate safer images. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents. In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model. We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z)
Plug-and-Hide: Provable and Adjustable Diffusion Generative Steganography [40.357567971092564]
Generative Steganography (GS) is a technique that utilizes generative models to conceal messages without relying on cover images. GS algorithms leverage the powerful generative capabilities of Diffusion Models (DMs) to create high-fidelity stego images. In this paper, we rethink the trade-off among image quality, steganographic security, and message extraction accuracy within Diffusion Generative Steganography (DGS) settings.
arXiv Detail & Related papers (2024-09-07T18:06:47Z)
C-RAG: Certified Generation Risks for Retrieval-Augmented Language Models [57.10361282229501]
We propose C-RAG, the first framework to certify generation risks for RAG models. Specifically, we provide conformal risk analysis for RAG models and certify an upper confidence bound of generation risks. We prove that RAG achieves a lower conformal generation risk than that of a single LLM when the quality of the retrieval model and transformer is non-trivial.
arXiv Detail & Related papers (2024-02-05T16:46:16Z)
MITS-GAN: Safeguarding Medical Imaging from Tampering with Generative Adversarial Networks [48.686454485328895]
This study introduces MITS-GAN, a novel approach to prevent tampering in medical images. The approach disrupts the output of the attacker's CT-GAN architecture by introducing finely tuned perturbations that are imperceptible to the human eye. Experimental results on a CT scan demonstrate MITS-GAN's superior performance.
arXiv Detail & Related papers (2024-01-17T22:30:41Z)
Attribute-Guided Encryption with Facial Texture Masking [64.77548539959501]
We propose Attribute Guided Encryption with Facial Texture Masking to protect users from unauthorized facial recognition systems. Our proposed method produces more natural-looking encrypted images than state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T23:50:43Z)
RGI: robust GAN-inversion for mask-free image inpainting and unsupervised pixel-wise anomaly detection [18.10039647382319]
We propose a Robust GAN-inversion (RGI) method with a provable robustness guarantee to achieve image restoration under unknown textitgross corruptions. We show that the restored image and the identified corrupted region mask convergeally to the ground truth. The proposed RGI/R-RGI method unifies two important applications with state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2023-02-24T05:43:03Z)
Guided Diffusion Model for Adversarial Purification [103.4596751105955]
Adversarial attacks disturb deep neural networks (DNNs) in various algorithms and frameworks. We propose a novel purification approach, referred to as guided diffusion model for purification (GDMP) On our comprehensive experiments across various datasets, the proposed GDMP is shown to reduce the perturbations raised by adversarial attacks to a shallow range.
arXiv Detail & Related papers (2022-05-30T10:11:15Z)
CUDA-GR: Controllable Unsupervised Domain Adaptation for Gaze Redirection [3.0141238193080295]
The aim of gaze redirection is to manipulate the gaze in an image to the desired direction. Advancement in generative adversarial networks has shown excellent results in generating photo-realistic images. To enable such fine-tuned control, one needs to obtain ground truth annotations for the training data which can be very expensive.
arXiv Detail & Related papers (2021-06-21T04:39:42Z)
Blur, Noise, and Compression Robust Generative Adversarial Networks [85.68632778835253]
We propose blur, noise, and compression robust GAN (BNCR-GAN) to learn a clean image generator directly from degraded images. Inspired by NR-GAN, BNCR-GAN uses a multiple-generator model composed of image, blur- Kernel, noise, and quality-factor generators. We demonstrate the effectiveness of BNCR-GAN through large-scale comparative studies on CIFAR-10 and a generality analysis on FFHQ.
arXiv Detail & Related papers (2020-03-17T17:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.