GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
- URL: http://arxiv.org/abs/2505.11049v1
- Date: Fri, 16 May 2025 09:46:10 GMT
- Title: GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
- Authors: Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, Bryan Hooi,
- Abstract summary: This paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL.<n>We construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs.<n>To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost.
- Score: 43.89818154399979
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/
Related papers
- Reasoning as an Adaptive Defense for Safety [31.00328416755368]
We build a recipe called $textitTARS$ (Training Adaptive Reasoners for Safety)<n>We train models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion.<n>Our work provides an effective, open recipe for training LLMs against jailbreaks and harmful requests by reasoning per prompt.
arXiv Detail & Related papers (2025-07-01T17:20:04Z) - SAFER: Probing Safety in Reward Models with Sparse Autoencoder [15.804171763844323]
We present sparse Autoencoder For Enhanced Reward model (textbfSAFER)<n>We uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making.<n>Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification.
arXiv Detail & Related papers (2025-07-01T11:04:03Z) - Saffron-1: Safety Inference Scaling [69.61130284742353]
SAFFRON is a novel inference scaling paradigm tailored explicitly for safety assurance.<n>Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations.<n>We publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M)
arXiv Detail & Related papers (2025-06-06T18:05:45Z) - Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models [34.66687625996389]
Multimodal large language models (MLLMs) are critical for developing general-purpose AI assistants, yet they face growing safety risks.<n>How can we ensure that MLLMs are safely aligned to prevent undesired behaviors such as discrimination, misinformation, or violations of ethical standards?<n>We propose Safe RLHF-V, the first multimodal safety alignment framework that jointly optimize helpfulness and safety.
arXiv Detail & Related papers (2025-03-22T07:40:20Z) - From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models [58.16075709485292]
CAREVL is a novel method for preference reward modeling by reliably using both high- and low-confidence data.<n> CAREVL achieves performance improvements over traditional distillation-based methods on VL-RewardBench and MLLM-as-a-Judge benchmark.
arXiv Detail & Related papers (2025-03-08T16:13:18Z) - SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities [21.317245896641136]
Long chain-of-thought (CoT) reasoning generates structured intermediate steps, enhancing reasoning capabilities.<n>Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs.
arXiv Detail & Related papers (2025-02-17T16:57:56Z) - OverThink: Slowdown Attacks on Reasoning LLMs [41.733352553317204]
OVERTHINK attack could amplify the costs for third-party applications operating reasoning models.<n>Our results show up to 18x slowdown on FreshQA dataset and 46x slowdown on SQuAD dataset.
arXiv Detail & Related papers (2025-02-04T18:12:41Z) - GuardReasoner: Towards Reasoning-based LLM Safeguards [63.53800124080227]
This paper proposes GuardReasoner, a new safeguard for LLMs.<n>We first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps.<n>Then, we introduce reasoning SFT to unlock the reasoning capability of guard models.<n>In this manner, GuardReasoner achieves better performance, explainability, and generalizability.
arXiv Detail & Related papers (2025-01-30T17:06:06Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.