Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
- URL: http://arxiv.org/abs/2507.06485v1
- Date: Wed, 09 Jul 2025 02:06:13 GMT
- Title: Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
- Authors: Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal,
- Abstract summary: Video-RTS is a new approach to improve video reasoning capability with drastically improved data efficiency.<n>We employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning.<n>We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy.
- Score: 65.86184845073075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite advances in reinforcement learning (RL)-based video reasoning with large language models (LLMs), data collection and finetuning remain significant challenges. These methods often rely on large-scale supervised fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT) annotations, making them costly and hard to scale. To address this, we present Video-RTS, a new approach to improve video reasoning capability with drastically improved data efficiency by combining data-efficient RL with a video-adaptive test-time scaling (TTS) strategy. Based on observations about the data scaling of RL samples, we skip the resource-intensive SFT step and employ efficient pure-RL training with output-based rewards, requiring no additional annotations or extensive fine-tuning. Furthermore, to utilize computational resources more efficiently, we introduce a sparse-to-dense video TTS strategy that improves inference by iteratively adding frames based on output consistency. We validate our approach on multiple video reasoning benchmarks, showing that Video-RTS surpasses existing video reasoning models by an average of 2.4% in accuracy using only 3.6% training samples. For example, Video-RTS achieves a 4.2% improvement on Video-Holmes, a recent and challenging video reasoning benchmark, and a 2.6% improvement on MMVU. Notably, our pure RL training and adaptive video TTS offer complementary strengths, enabling Video-RTS's strong reasoning performance.
Related papers
- Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning [33.170426237654596]
VIDEORFT is a novel approach to cultivate human-like video reasoning capabilities in MLLMs.<n>It follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization.<n>It achieves state-of-the-art performance on six video reasoning benchmarks.
arXiv Detail & Related papers (2025-05-18T14:14:35Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Video-R1: Reinforcing Video Reasoning in MLLMs [30.13366332687375]
Video-R1 is the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning.<n>We first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning.<n>We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.
arXiv Detail & Related papers (2025-03-27T17:59:51Z) - Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding [57.26400319795876]
Temporal Video Grounding (TVG) is a core challenge in long-form video understanding.<n>Recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning.<n>We propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning.
arXiv Detail & Related papers (2025-03-17T17:04:20Z) - Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation [98.92677830223786]
This work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective.<n>We propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data.<n>Our proposed method achieves performance comparable to or even superior to baselines trained with many more samples.
arXiv Detail & Related papers (2024-11-29T18:59:54Z) - EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training [15.684865589513597]
We propose an efficient patch sampling method named EPS for video SR network overfitting.
Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters.
Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
arXiv Detail & Related papers (2024-11-25T12:01:57Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.