DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
- URL: http://arxiv.org/abs/2506.07464v4
- Date: Fri, 31 Oct 2025 12:13:12 GMT
- Title: DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
- Authors: Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim,
- Abstract summary: Group Relative Policy Optimization has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards.<n>In this paper, we explore GRPO and identify two problems that deteriorate the effective learning.<n>We propose DeepVideo-R1, a video large language model trained with Reg- GRPO and difficulty-aware data augmentation.
- Score: 37.07375927420007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment [28.18756041538092]
We present TAGRPO, a robust framework for I2V models inspired by contrastive learning.<n>Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization.
arXiv Detail & Related papers (2026-01-09T11:15:27Z) - Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting [0.7365798659670144]
Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models.<n>It suffers from a critical vulnerability: the emphThink-Answer Mismatch, where noisy reward signals corrupt the learning process.<n>We propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training.
arXiv Detail & Related papers (2025-08-08T01:24:06Z) - EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for Advantage Diversity [7.818698554631196]
Group Relative Policy Optimization (GRPO) algorithm relies on sparse reward rules, leading to the advantage collapse problem.<n>We propose the EDGE-GRPO algorithm, which adopts textbfEntropy-textbfDriven Advantage and textbfGuided textbfError Correction to effectively mitigate the problem of advantage collapse.
arXiv Detail & Related papers (2025-07-29T14:23:58Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - Reinforcing Video Reasoning with Focused Thinking [65.85683941058916]
Novel framework enhances visual reasoning with focused thinking and dense reward granularity.<n>We employ a token weighting mechanism that prioritizes tokens with high informational density.<n>We reformulate RL training by shifting from single-choice to multi-choice QA tasks.
arXiv Detail & Related papers (2025-05-30T15:42:19Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO [22.00487909203855]
Group Relative Policy Optimization fails to update a policy when all responses within a group are incorrect.<n>This limitation underscores a key gap between artificial and human intelligence.<n>We introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups.
arXiv Detail & Related papers (2025-05-16T18:02:05Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z) - GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models [0.17265013728931003]
GRPO-LEAD is a suite of novel enhancements tailored for mathematical reasoning.<n>It introduces (1) a length-dependent accuracy reward to encourage concise and precise solutions, (2) an explicit penalty mechanism for incorrect answers to sharpen decision boundaries, and (3) a difficulty-aware advantage reweighting strategy that amplifies learning signals for challenging problems.
arXiv Detail & Related papers (2025-04-13T19:07:45Z) - Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding [57.26400319795876]
Temporal Video Grounding (TVG) is a core challenge in long-form video understanding.<n>Recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning.<n>We propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning.
arXiv Detail & Related papers (2025-03-17T17:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.