Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
- URL: http://arxiv.org/abs/2512.00961v1
- Date: Sun, 30 Nov 2025 16:22:27 GMT
- Title: Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
- Authors: Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, Wenjun Zeng,
- Abstract summary: We exploit off-the-shelf video diffusion models pretrained on large-scale video datasets.<n>For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets.<n>We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward.
- Score: 58.33560203572211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior. Designing such reward functions can be challenging and may not generalize well across different tasks. To address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. For video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. To enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. We then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-World tasks demonstrate the effectiveness of our approach.
Related papers
- Reinforcement Learning with Inverse Rewards for World Model Post-training [29.19830208692156]
We propose Reinforcement Learning with Inverse Rewards to improve action-following in video world models.<n>RLIR derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.
arXiv Detail & Related papers (2025-09-28T16:27:47Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data [56.217490064597506]
We propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data.<n>We use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward.<n>Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.
arXiv Detail & Related papers (2025-03-23T21:24:33Z) - RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research.
We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks.
We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z) - Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection [51.004020874336284]
VidTFS is a Training-free, open-vocabulary video goal and action inference framework.
Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly.
We validate the performance of the proposed VidTFS on four widely used video datasets.
arXiv Detail & Related papers (2024-01-23T03:45:05Z) - Video Prediction Models as Rewards for Reinforcement Learning [127.53893027811027]
VIPER is an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning.
We see our work as starting point for scalable reward specification from unlabeled videos.
arXiv Detail & Related papers (2023-05-23T17:59:33Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - Learning Goals from Failure [30.071336708348472]
We introduce a framework that predicts the goals behind observable human action in video.
Motivated by evidence in developmental psychology, we leverage video of unintentional action to learn video representations of goals without direct supervision.
arXiv Detail & Related papers (2020-06-28T17:16:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.