GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
- URL: http://arxiv.org/abs/2511.20994v1
- Date: Wed, 26 Nov 2025 02:49:51 GMT
- Title: GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision
- Authors: Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu,
- Abstract summary: GuardTrace-VL is a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis.<n>We propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences.<n>On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks.
- Score: 47.99880677909197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models [58.17589701432514]
Think-Reflect-Revise (TRR) is a training framework designed to enhance the safety alignment of Large Vision Language Models (LVLMs)<n>We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process.<n>We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning.
arXiv Detail & Related papers (2025-12-08T03:46:03Z) - SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - InvThink: Towards AI Safety via Inverse Reasoning [23.940337534762563]
InvThink gives large language models the capability of inverse thinking: reasoning through failure modes before generating responses.<n>Our method reveals three key findings: (i) safety improvements show stronger scaling with model size compared to existing safety methods.<n>InvThink excels in high-stakes domains including external-facing (medicine, finance, law) and agentic (blackmail, murder) risk scenarios, achieving up to 15.7% reduction in harmful responses.
arXiv Detail & Related papers (2025-10-02T01:26:53Z) - Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z) - LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation [4.29885665563186]
LATENTGUARD is a framework that combines behavioral alignment with supervised latent space control for interpretable and precise safety steering.<n>Our results show significant improvements in both safety controllability and response interpretability without compromising utility.
arXiv Detail & Related papers (2025-09-24T07:31:54Z) - ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models [20.274878511727945]
ReasoningShield is a framework for moderating Chain-of-Thoughts (CoTs) in Large Reasoning Models (LRMs)<n> ReasoningShield achieves state-of-the-art performance, outperforming task-specific tools like LlamaGuard-4 by 35.6% and general-purpose commercial models like GPT-4o by 15.8% on benchmarks.
arXiv Detail & Related papers (2025-05-22T19:44:41Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - Safe Vision-Language Models via Unsafe Weights Manipulation [75.04426753720551]
We revise safety evaluation by introducing Safe-Ground, a new set of metrics that evaluate safety at different levels of granularity.<n>We take a different direction and explore whether it is possible to make a model safer without training, introducing Unsafe Weights Manipulation (UWM)<n>UWM uses a calibration set of safe and unsafe instances to compare activations between safe and unsafe content, identifying the most important parameters for processing the latter.
arXiv Detail & Related papers (2025-03-14T17:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.