TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
- URL: http://arxiv.org/abs/2509.18056v2
- Date: Thu, 25 Sep 2025 14:28:56 GMT
- Title: TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
- Authors: Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng,
- Abstract summary: This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks.<n>We show that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets.
- Score: 67.55973229034319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1
Related papers
- Agile Reinforcement Learning through Separable Neural Architecture [0.8577671031243427]
This work introduces SPAN, a novel function approximation approach to deep reinforcement learning.<n>SPAN achieves a 30-50% improvement in sample efficiency and 1.3-9 times higher success rates across benchmarks compared to baselines.
arXiv Detail & Related papers (2026-01-30T17:47:36Z) - Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z) - Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z) - TS-DP: Reinforcement Speculative Decoding For Temporal Adaptive Diffusion Policy Acceleration [64.32072516882947]
Diffusion Policy excels in embodied control but suffers from high inference latency and computational cost.<n>We propose Temporal-aware Reinforcement-based Speculative Diffusion Policy (TS-DP)<n>TS-DP achieves up to 4.17 times faster inference with over 94% accepted drafts, reaching an inference frequency of 25 Hz.
arXiv Detail & Related papers (2025-12-13T07:53:14Z) - Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization [13.475938754147625]
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks.<n>Agentic Reinforcement Learning (Agentic RL) optimize such models over full tool-interaction trajectories.<n>Two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence.<n>We propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO).
arXiv Detail & Related papers (2025-12-08T11:59:25Z) - Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning [18.760525047404098]
Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research.<n>Standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training.<n>We introduce staggered resets, a simple yet effective technique where environments are reset at varied points within the task horizon.
arXiv Detail & Related papers (2025-11-26T03:20:08Z) - Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning [45.51804571136028]
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs)<n>We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages.<n>SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training.
arXiv Detail & Related papers (2025-10-05T07:22:54Z) - Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z) - Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation [28.63391989014238]
Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time.<n>We propose a model-based algorithm that achieves both sample and computational efficiency.<n>We show that a near-optimal policy can be learned with a suboptimality gap of $tildeO(sqrtd_mathcalR + d_mathcalFN-1/2)$ using $N$ measurements.
arXiv Detail & Related papers (2025-05-20T18:37:51Z) - LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models [23.218237408724676]
We propose LoRA-TTT, a novel Test-Time Training (TTT) method for vision-language models (VLMs)<n>By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach.<n>Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime.
arXiv Detail & Related papers (2025-02-04T07:40:26Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.