Related papers: MLLM-as-a-Judge for Image Safety without Human Labeling

MLLM-as-a-Judge for Image Safety without Human Labeling

URL: http://arxiv.org/abs/2501.00192v1
Date: Tue, 31 Dec 2024 00:06:04 GMT
Title: MLLM-as-a-Judge for Image Safety without Human Labeling
Authors: Zhenting Wang, Shuming Hu, Shiyu Zhao, Xiaowen Lin, Felix Juefei-Xu, Zhuowei Li, Ligong Han, Harihar Subramanyam, Li Chen, Jianfa Chen, Nan Jiang, Lingjuan Lyu, Shiqing Ma, Dimitris N. Metaxas, Ankit Jain,
Abstract summary: In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.<n>It is crucial to identify such unsafe images based on established safety rules.<n>Existing approaches typically fine-tune MLLMs with human-labeled datasets.
Score: 81.24707039432292
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.

Related papers

Beyond Visual Safety: Jailbreaking Multimodal Large Language Models for Harmful Image Generation via Semantic-Agnostic Inputs [2.903006172774433]
Beyond Visual Safety (BVS) is a novel image-text pair jailbreaking framework designed to probe the visual safety boundaries of MLLMs.<n>BVS employs a "reconstruction-then-generation" strategy, leveraging visual splicing and inductive recomposition to decouple malicious intent from raw inputs.<n>Our findings expose critical vulnerabilities in the visual safety alignment of current MLLMs.
arXiv Detail & Related papers (2026-01-22T06:56:27Z)
The Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning [46.156246746700894]
We introduce MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety.<n>We find that models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench.<n> unsafe generations exhibit lower attention entropy than safe ones on average.
arXiv Detail & Related papers (2026-01-20T16:24:18Z)
SafeGuider: Robust and Practical Content Safety Control for Text-to-Image Models [74.11062256255387]
Text-to-image models are highly vulnerable to adversarial prompts, which can bypass safety measures and produce harmful content.<n>We introduce SafeGuider, a two-step framework designed for robust safety control without compromising generation quality.<n>SafeGuider demonstrates exceptional effectiveness in minimizing attack success rates, achieving a maximum rate of only 5.48% across various attack scenarios.
arXiv Detail & Related papers (2025-10-05T10:24:48Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Can't See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs [56.440345471966666]
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. This paper introduces MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios. MMSafeAware includes both unsafe and over-safety subsets to assess models abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness.
arXiv Detail & Related papers (2025-02-16T16:12:40Z)
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs. We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.36429361299807]
multimodal large language models (MLLMs) have demonstrated remarkable success in engaging in conversations involving visual inputs. The integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs. We introduce a technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution.
arXiv Detail & Related papers (2024-09-17T17:14:41Z)
EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts. We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt. Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z)
Is it safe to cross? Interpretable Risk Assessment with GPT-4V for Safety-Aware Street Crossing [8.468153670795443]
This paper introduces an innovative approach that leverages large multimodal models (LMMs) to interpret complex street crossing scenes. By generating a safety score and scene description in natural language, our method supports safe decision-making for the blind and low-vision individuals.
arXiv Detail & Related papers (2024-02-09T21:37:13Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.