Improved Visual-Spatial Reasoning via R1-Zero-Like Training
- URL: http://arxiv.org/abs/2504.00883v2
- Date: Mon, 14 Apr 2025 20:12:57 GMT
- Title: Improved Visual-Spatial Reasoning via R1-Zero-Like Training
- Authors: Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng,
- Abstract summary: Video visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of multi-modal large language models (MLLMs)<n>This work conducts a first, in-depth study on improving the visualspatial reasoning of MLLMs via R1-Zero-like training.
- Score: 18.61987706753493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.
Related papers
- SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM [18.275547804539016]
Two-Staged history-Resampling Policy optimization surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks.
We introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples.
arXiv Detail & Related papers (2025-04-19T13:06:03Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning [31.014515049981817]
Rule-based reinforcement learning can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks.<n>We introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms.<n> Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks.
arXiv Detail & Related papers (2025-03-27T15:39:30Z) - Understanding R1-Zero-Like Training: A Critical Perspective [38.515771096651356]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z) - Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning [26.14137626882127]
Large Vision-Language Models (LVLMs) typically follow a two-stage training paradigm-pretraining and supervised fine-tuning.
preference optimization, derived from the language domain, has emerged as an effective post-training reinforcement strategy.
We propose Vision-R1, a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback.
arXiv Detail & Related papers (2025-03-23T10:21:14Z) - OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement [91.88062410741833]
This study investigates whether similar reasoning capabilities can be successfully integrated into large vision-language models (LVLMs)
We consider an approach that iteratively leverages supervised fine-tuning (SFT) on lightweight training data and Reinforcement Learning (RL) to further improve model generalization.
OpenVLThinker, a LVLM exhibiting consistently improved reasoning performance on challenging benchmarks such as MathVista, MathVerse, and MathVision, demonstrates the potential of our strategy for robust vision-language reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [24.45348222168512]
We propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.<n>Our model achieves an average improvement of $sim$6% across various multimodal math reasoning benchmarks.<n>Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1.
arXiv Detail & Related papers (2025-03-09T20:06:45Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z) - Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.
We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.