Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
- URL: http://arxiv.org/abs/2504.16656v2
- Date: Fri, 25 Apr 2025 15:28:34 GMT
- Title: Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
- Authors: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou,
- Abstract summary: We present Skywork R1V2, a next-generation multimodal reasoning model.<n>At its core, R1V2 introduces a hybrid reinforcement learning paradigm.
- Score: 17.233735911531117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Skywork R1V2, a next-generation multimodal reasoning model and a major leap forward from its predecessor, Skywork R1V. At its core, R1V2 introduces a hybrid reinforcement learning paradigm that jointly leverages the Mixed Preference Optimization (MPO) and the Group Relative Policy Optimization (GRPO), which harmonizes reward-model guidance with rule-based strategies, thereby addressing the long-standing challenge of balancing sophisticated reasoning capabilities with broad generalization. To further enhance training efficiency, we introduce the Selective Sample Buffer (SSB) mechanism, which effectively counters the ``Vanishing Advantages'' dilemma inherent in GRPO by prioritizing high-value samples throughout the optimization process. Notably, we observe that excessive reinforcement signals can induce visual hallucinations--a phenomenon we systematically monitor and mitigate through calibrated reward thresholds throughout the training process. Empirical results affirm the exceptional capability of R1V2, with benchmark-leading performances such as 62.6 on OlympiadBench, 78.9 on AIME2024, 63.6 on LiveCodeBench, and 73.6 on MMMU. These results underscore R1V2's superiority over existing open-source models and demonstrate significant progress in closing the performance gap with premier proprietary systems, including Gemini 2.5 and OpenAI-o4-mini. The Skywork R1V2 model weights have been publicly released to promote openness and reproducibility https://huggingface.co/Skywork/Skywork-R1V2-38B.
Related papers
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization [86.30192066451256]
We propose RL-PLUS, a novel hybrid-policy optimization approach for Large Language Models (LLMs)<n> RL-PLUS synergizes internal exploitation with external data to achieve stronger reasoning capabilities and surpass the boundaries of base models.<n>We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach.
arXiv Detail & Related papers (2025-07-31T23:55:29Z) - MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization [74.04867639197445]
MiroMind-M1 is a set of fully open-source RLMs built on the Qwen-2.5-based benchmarks.<n>Our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems.
arXiv Detail & Related papers (2025-07-19T16:21:23Z) - Skywork-R1V3 Technical Report [14.952041273882639]
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM)<n>Key innovation lies in effectively transferring reasoning skills from text-only Large Language Models to visual tasks.<n>We introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection.
arXiv Detail & Related papers (2025-07-08T16:47:16Z) - Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning [48.426139299991604]
We present textbfSPRO, a novel framework that enables process-aware RL through two key innovations.<n>SPRO outperforms vaniila GRPO with 3.4x higher training efficiency and a 17.5% test accuracy improvement.<n> Notably, SPRO incurs no additional computational overhead compared to outcome-supervised RL methods such as GRPO, which benefit industrial implementation.
arXiv Detail & Related papers (2025-07-02T10:05:14Z) - Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [26.455112415445146]
We introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a subset of 26 million preference pairs from SynPref-40M.<n>We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling.
arXiv Detail & Related papers (2025-07-02T04:40:29Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models [18.06361678575107]
We propose textbfDINO-R1, the first attempt to incentivize visual in-context reasoning capabilities of vision foundation models.<n>DINO-R1 introduces textbfGroup Relative Query Optimization (GRQO), a novel reinforcement-style training strategy.<n>Experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines.
arXiv Detail & Related papers (2025-05-29T21:58:06Z) - Skywork Open Reasoner 1 Technical Report [51.403686909760914]
We present Skywork-OR1, an effective and scalable reinforcement learning (RL) implementation for long Chain-of-Thought (CoT) models.<n>Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains.<n>Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks.
arXiv Detail & Related papers (2025-05-28T12:56:04Z) - Reinforcement Learning for Reasoning in Large Language Models with One Training Example [129.11571295790807]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs)
We identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated.
arXiv Detail & Related papers (2025-04-29T09:24:30Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.
We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.
Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [55.97950660659051]
GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection.<n>In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning.<n>Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively.
arXiv Detail & Related papers (2025-04-10T17:41:56Z) - Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought [16.183329458166618]
We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method.<n>We propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to enhance cross-modal integration efficiency.<n> Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista.
arXiv Detail & Related papers (2025-04-08T01:19:20Z) - Improved Visual-Spatial Reasoning via R1-Zero-Like Training [18.61987706753493]
Video visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of multi-modal large language models (MLLMs)<n>This work conducts a first, in-depth study on improving the visualspatial reasoning of MLLMs via R1-Zero-like training.
arXiv Detail & Related papers (2025-04-01T15:11:11Z) - OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning [29.053899071144976]
We propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks.
Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy.
GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation.
arXiv Detail & Related papers (2025-03-20T12:22:18Z) - Explainable AI-aided Feature Selection and Model Reduction for DRL-based V2X Resource Allocation [18.49800990388549]
Artificial intelligence (AI) is expected to significantly enhance radio resource management (RRM) in sixth-generation (6G) networks.<n>The lack of explainability in complex deep learning (DL) models poses a challenge for practical implementation.<n>This paper proposes a novel explainable AI (XAI)-based framework for feature selection and model complexity reduction.
arXiv Detail & Related papers (2025-01-23T10:55:38Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences [21.5605000515622]
This paper studies post-training large language models (LLMs) using preference feedback from an oracle to help a model iteratively improve over itself.
We introduce Direct Nash Optimization (DNO), a provable and efficient algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences.
In our experiments, a resulting 7B parameter Orca-2.5 model achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaE 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model
arXiv Detail & Related papers (2024-04-04T17:56:41Z) - WARM: On the Benefits of Weight Averaged Reward Models [63.08179139233774]
We propose Weight Averaged Reward Models (WARM) to mitigate reward hacking.
Experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the overall quality and alignment of LLM predictions.
arXiv Detail & Related papers (2024-01-22T18:27:08Z) - SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision.
We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z) - When Does Contrastive Learning Preserve Adversarial Robustness from
Pretraining to Finetuning? [99.4914671654374]
We propose AdvCL, a novel adversarial contrastive pretraining framework.
We show that AdvCL is able to enhance cross-task robustness transferability without loss of model accuracy and finetuning efficiency.
arXiv Detail & Related papers (2021-11-01T17:59:43Z) - Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks.
Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL.
Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.