Related papers: COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

URL: http://arxiv.org/abs/2510.04196v1
Date: Sun, 05 Oct 2025 13:30:03 GMT
Title: COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability
Authors: Yizhuo Ding, Mingkang Chen, Qiuhua Liu, Fenghua Weng, Wanying Qu, Yue Yang, Yugang Jiang, Zuxuan Wu, Yanwei Fu, Wenqi Shao,
Abstract summary: We present COSMO-RL, a mixed reinforcement learning framework that trains LMRMs under multimodal, multitask, and multiobjective signals.<n>Our approach aims to let safety and capability grow together in one stable pipeline rather than competing during alignment.
Score: 101.80200069234377
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Reasoning Models (LMRMs) are moving into real applications, where they must be both useful and safe. Safety is especially challenging in multimodal settings: images and text can be combined to bypass guardrails, and single objective training can cause policy drift that yields over-refusal on benign inputs or unsafe compliance on risky ones. We present COSMO-RL, a mixed reinforcement learning framework that trains reasoning oriented LMRMs under multimodal, multitask, and multiobjective signals, and we release the resulting model, COSMO-R1. Our approach aims to let safety and capability grow together in one stable pipeline rather than competing during alignment. In experiments, COSMO-R1 improves safety while maintaining-and often improving multimodal reasoning and instruction following, shows stronger robustness to multimodal jailbreaks, and reduces unnecessary refusals. The framework also transfers across backbones with consistent gains. Ablations support the design choices, indicating a simple path to advancing safety and general capability together in LMRMs.

Related papers

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs [10.42126976065225]
Multimodal large language models (MLLMs) enable interaction over both text and images.<n>This paper introduces CSR-Bench, a benchmark for evaluating cross-modal reliability.<n>We evaluate 16 state-of-the-art MLLMs and observe systematic cross-modal alignment gaps.
arXiv Detail & Related papers (2026-02-03T08:49:44Z)
SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models [50.66979825532277]
We introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for a cross-modal challenge.<n>A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset.<n> Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks.
arXiv Detail & Related papers (2025-09-15T15:40:58Z)
Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training [1.5349686675266894]
Current methods for content safety in Large Language Models (LLMs) rely on multi-stage training pipelines.<n>We propose a unified co-training framework that efficiently integrates multiple safety behaviors.<n>We show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance.
arXiv Detail & Related papers (2025-08-12T02:39:33Z)
Automating Steering for Safe Multimodal Large Language Models [58.36932318051907]
We introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model.<n>AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model's internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected.
arXiv Detail & Related papers (2025-07-17T16:04:55Z)
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models [17.824240702928133]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning tasks through enhanced chain-of-thought capabilities.<n>Existing safety alignment approaches fall short in addressing the complex and nuanced threats posed by multimodal inputs.<n>MSR-Align supports fine-grained, deliberative reasoning over standardized safety policies across both vision and text modalities.
arXiv Detail & Related papers (2025-06-24T02:37:59Z)
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models [37.104276926258095]
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data.<n>We introduce textbfDREAM (textittextbfDisentangling textbfRisks to textbfEnhance Safety textbfAlignment in textbfMLLMs), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback.
arXiv Detail & Related papers (2025-04-25T03:54:24Z)
Think Smart, Act SMARL! Analyzing Probabilistic Logic Shields for Multi-Agent Reinforcement Learning [3.7957452405531265]
Shielded Multi-Agent Reinforcement Learning (SMARL) is a general framework for steering MARL towards norm-compliant outcomes.<n>Our key contributions are:.<n>a novel Probabilistic Logic Temporal Difference (PLTD) update for shielded, independent Q-learning;.<n>a probabilistic logic policy gradient method for shielded PPO with formal safety guarantees for MARL;.<n> comprehensive evaluation across symmetric and asymmetrically shielded $n$ of player game-theoretic benchmarks.
arXiv Detail & Related papers (2024-11-07T16:59:32Z)
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.<n>To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.<n>Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.