A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
- URL: http://arxiv.org/abs/2601.10527v2
- Date: Fri, 16 Jan 2026 15:04:58 GMT
- Title: A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
- Authors: Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang,
- Abstract summary: Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have driven major gains in reasoning, perception, and generation across language and vision.<n>We present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation.
- Score: 101.4233736714284
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional--shaped by modality, language, and evaluation design--underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.
Related papers
- ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z) - OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models [54.80460603255789]
We introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era.<n>OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories.<n>In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories.
arXiv Detail & Related papers (2025-11-13T13:18:27Z) - Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models [0.0]
This study proposes a unified framework for evaluating how resistant Large Language Models (LLMs) are to prompt injection attacks.<n>The framework defines three metrics such as the Resilience Degradation Index (RDI), Safety Compliance Coefficient ( SCC), and instructional Integrity Metric (IIM) to jointly measure robustness, safety, and semantic stability.
arXiv Detail & Related papers (2025-11-03T14:43:56Z) - Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z) - DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z) - SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z) - PL-Guard: Benchmarking Language Model Safety for Polish [43.39208658482427]
We introduce a manually annotated benchmark dataset for language model safety classification in Polish.<n>We also create adversarially perturbed variants of these samples designed to challenge model robustness.<n>We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models.
arXiv Detail & Related papers (2025-06-19T13:56:41Z) - REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM [0.098314893665023]
We introduce the REVEAL Framework, a scalable and automated pipeline for evaluating image-input harms in Vision Large Language Models (VLLMs)<n>We extensively evaluated five state-of-the-art VLLMs, GPT-4o, Llama-3.2, Qwen2-VL, Phi3.5V, and Pixtral, across three important harm categories: sexual harm, violence, and misinformation.<n>GPT-4o demonstrated the most balanced performance as measured by our Safety-Usability Index (SUI) followed closely by Pixtral.
arXiv Detail & Related papers (2025-05-07T10:09:55Z) - Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings [51.65890794988425]
This study presents the first comprehensive safety evaluation of the DeepSeek models.<n>Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models.
arXiv Detail & Related papers (2025-03-19T10:44:37Z) - Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.<n>Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.