Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
- URL: http://arxiv.org/abs/2504.05599v1
- Date: Tue, 08 Apr 2025 01:19:20 GMT
- Title: Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
- Authors: Yi Peng, Chris, Xiaokun Wang, Yichen Wei, Jiangbo Pei, Weijie Qiu, Ai Jian, Yunzhuo Hao, Jiachun Pan, Tianyidan Xie, Li Ge, Rongxian Zhuang, Xuchen Song, Yang Liu, Yahui Zhou,
- Abstract summary: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method.<n>We propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to enhance cross-modal integration efficiency.<n> Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista.
- Score: 16.183329458166618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Skywork R1V, a multimodal reasoning model extending the an R1-series Large language models (LLM) to visual modalities via an efficient multimodal transfer method. Leveraging a lightweight visual projector, Skywork R1V facilitates seamless multimodal adaptation without necessitating retraining of either the foundational language model or the vision encoder. To strengthen visual-text alignment, we propose a hybrid optimization strategy that combines Iterative Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), significantly enhancing cross-modal integration efficiency. Additionally, we introduce an adaptive-length Chain-of-Thought distillation approach for reasoning data generation. This approach dynamically optimizes reasoning chain lengths, thereby enhancing inference efficiency and preventing excessive reasoning overthinking. Empirical evaluations demonstrate that Skywork R1V, with only 38B parameters, delivers competitive performance, achieving a score of 69.0 on the MMMU benchmark and 67.5 on MathVista. Meanwhile, it maintains robust textual reasoning performance, evidenced by impressive scores of 72.0 on AIME and 94.0 on MATH500. The Skywork R1V model weights have been publicly released to promote openness and reproducibility.
Related papers
- MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization [74.04867639197445]
MiroMind-M1 is a set of fully open-source RLMs built on the Qwen-2.5-based benchmarks.<n>Our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems.
arXiv Detail & Related papers (2025-07-19T16:21:23Z) - Skywork-R1V3 Technical Report [14.952041273882639]
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM)<n>Key innovation lies in effectively transferring reasoning skills from text-only Large Language Models to visual tasks.<n>We introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection.
arXiv Detail & Related papers (2025-07-08T16:47:16Z) - Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought [196.74837065805488]
Hunyuan-TurboS is a large hybrid Transformer-Mamba Mixture of Experts model.<n>It balances high performance and efficiency, offering substantial capabilities at lower inference costs.
arXiv Detail & Related papers (2025-05-21T12:11:53Z) - Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning [17.233735911531117]
We present Skywork R1V2, a next-generation multimodal reasoning model.
At its core, R1V2 introduces a hybrid reinforcement learning paradigm.
arXiv Detail & Related papers (2025-04-23T12:24:10Z) - Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.
APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.
A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z) - ReasoningV: Efficient Verilog Code Generation with Adaptive Hybrid Reasoning Model [7.798551697095774]
ReasoningV is a novel model that integrates trained intrinsic capabilities with dynamic inference adaptation for Verilog code generation.
Our framework introduces three complementary innovations: ReasoningV-5K, a high-quality dataset of 5,000 functionally verified instances with reasoning paths created through multi-dimensional filtering of PyraNet samples.
Experimental results demonstrate ReasoningV's effectiveness with a pass@1 accuracy of 57.8% on VerilogEval-human.
arXiv Detail & Related papers (2025-04-20T10:16:59Z) - VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning [55.97950660659051]
GPT-o1 and DeepSeek-R1 have demonstrated great potential in solving challenging problems through explicit reflection.
In this paper, we aim to enhance the slow-thinking capabilities of vision-language models using reinforcement learning.
Our model, VL-Rethinker, advances state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve 80.3%, 61.8%, and 43.9% respectively.
arXiv Detail & Related papers (2025-04-10T17:41:56Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [24.45348222168512]
We propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.<n>Our model achieves an average improvement of $sim$6% across various multimodal math reasoning benchmarks.<n>Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1.
arXiv Detail & Related papers (2025-03-09T20:06:45Z) - Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization [19.37373012848517]
Large Vision Language Models (VLMs) are prone to significant hallucinations, particularly in the form of cross-modal inconsistencies.
We introduce Re-Align, a novel alignment framework that leverages image retrieval to construct a dual-preference dataset.
We also introduce rDPO, an extension of the standard direct preference optimization that incorporates an additional visual preference objective during fine-tuning.
arXiv Detail & Related papers (2025-02-18T18:59:57Z) - Kimi k1.5: Scaling Reinforcement Learning with LLMs [84.2229964736678]
We report on the training practice of Kimi k1.5, our latest multi-modal language model trained with reinforcement learning.
Long context scaling and improved policy optimization methods are key ingredients of our approach.
Our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities.
arXiv Detail & Related papers (2025-01-22T02:48:14Z) - Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [90.86370957353911]
Chain-of-Reasoning (CoR) is a novel unified framework that integrates multiple reasoning paradigms.<n>CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution.<n> Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z) - Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs.<n>Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset.<n>We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Self-Supervised Visual Preference Alignment [21.552415796397206]
This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs).
We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization.
It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers.
arXiv Detail & Related papers (2024-04-16T12:19:54Z) - Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching [49.730741713652435]
In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model.
For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model.
To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
arXiv Detail & Related papers (2023-01-07T17:24:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.