Related papers: OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

URL: http://arxiv.org/abs/2601.01592v2
Date: Sat, 10 Jan 2026 16:23:49 GMT
Title: OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs
Authors: Xin Wang, Yunhao Chen, Juncheng Li, Yixu Wang, Yang Yao, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang, Xia Hu,
Abstract summary: We introduce OpenRT, a unified, modular, and high- throughput red-teaming framework for MLLM safety evaluation.<n>At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five dimensions.<n>Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies.
Score: 36.57820295876294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid integration of Multimodal Large Language Models (MLLMs) into critical applications is increasingly hindered by persistent safety vulnerabilities. However, existing red-teaming benchmarks are often fragmented, limited to single-turn text interactions, and lack the scalability required for systematic evaluation. To address this, we introduce OpenRT, a unified, modular, and high-throughput red-teaming framework designed for comprehensive MLLM safety evaluation. At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five critical dimensions: model integration, dataset management, attack strategies, judging methods, and evaluation metrics. By standardizing attack interfaces, it decouples adversarial logic from a high-throughput asynchronous runtime, enabling systematic scaling across diverse models. Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies. Through an extensive empirical study on 20 advanced models (including GPT-5.2, Claude 4.5, and Gemini 3 Pro), we expose critical safety gaps: even frontier models fail to generalize across attack paradigms, with leading models exhibiting average Attack Success Rates as high as 49.14%. Notably, our findings reveal that reasoning models do not inherently possess superior robustness against complex, multi-turn jailbreaks. By open-sourcing OpenRT, we provide a sustainable, extensible, and continuously maintained infrastructure that accelerates the development and standardization of AI safety.

Related papers

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models [17.848889547838173]
MUSE (Multimodal Unified Safety Evaluation) is an open-source, run-centric platform that integrates automatic cross-modal payload generation.<n>A dual-metric framework distinguishes hard Attack Success Rate (Compliance only) from soft ASR (including Partial Compliance)<n>Experiments show that multi-turn strategies can achieve up to 90-100% ASR against models with near-perfect single-turn refusal.
arXiv Detail & Related papers (2026-03-03T00:10:23Z)
Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models [54.61181161508336]
We introduce Multi-Faceted Attack (MFA), a framework that exposes general safety vulnerabilities in leading defense-equipped Vision-Language Models (VLMs)<n>The core component of MFA is the Attention-Transfer Attack (ATA), which hides harmful instructions inside a meta task with competing objectives.<n>MFA achieves a 58.5% success rate and consistently outperforms existing methods.
arXiv Detail & Related papers (2025-11-20T07:12:54Z)
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations [0.0]
Multimodal large language models (MLLMs) have achieved remarkable progress, yet remain critically vulnerable to adversarial attacks.<n>We present a systematic study of multimodal jailbreaks targeting both vision-language and audio-language models.<n>Our evaluation spans 1,900 adversarial prompts across three high-risk safety categories.
arXiv Detail & Related papers (2025-10-23T05:16:33Z)
ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks [30.39725685183195]
We propose ARMs, an adaptive red-teaming agent that conducts comprehensive risk assessments for vision-language models (VLMs)<n>Given a target harmful behavior or risk definition, ARMs automatically optimize diverse red-teaming strategies with reasoning-enhanced multi-step orchestration.<n>We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities inVLMs.
arXiv Detail & Related papers (2025-10-03T02:28:02Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.<n>AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.<n> Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey [50.031628043029244]
Multimodal generative models are susceptible to jailbreak attacks, which can bypass built-in safety mechanisms and induce the production of potentially harmful content.<n>We present a detailed taxonomy of attack methods, defense mechanisms, and evaluation frameworks specific to multimodal generative models.
arXiv Detail & Related papers (2024-11-14T07:51:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.