Related papers: GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling

URL: http://arxiv.org/abs/2602.05202v1
Date: Thu, 05 Feb 2026 01:54:01 GMT
Title: GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
Authors: Shivanshu Shekhar, Uttaran Bhattacharya, Raghavendra Addanki, Mehrab Tanjim, Somdeb Sarkhel, Tong Zhang,
Abstract summary: Video generative models are inherently designed to model temporal structure as reward models.<n>Modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations.
Score: 18.51125967961176
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Aligning video generative models with human preferences remains challenging: current approaches rely on Vision-Language Models (VLMs) for reward modeling, but these models struggle to capture subtle temporal dynamics. We propose a fundamentally different approach: repurposing video generative models, which are inherently designed to model temporal structure, as reward models. We present the Generative-Transformer-based Self-Supervised Video Judge (\modelname), a novel evaluation model that transforms state-of-the-art video generation models into powerful temporally-aware reward models. Our key insight is that generative models can be reformulated as energy-based models (EBMs) that assign low energy to high-quality videos and high energy to degraded ones, enabling them to discriminate video quality with remarkable precision when trained via contrastive objectives. To prevent the model from exploiting superficial differences between real and generated videos, we design challenging synthetic negative videos through controlled latent-space perturbations: temporal slicing, feature swapping, and frame shuffling, which simulate realistic but subtle visual degradations. This forces the model to learn meaningful spatiotemporal features rather than trivial artifacts. \modelname achieves state-of-the-art performance on GenAI-Bench and MonteBench using only 30K human-annotations: $6\times$ to $65\times$ fewer than existing VLM-based approaches.

Related papers

DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model [5.78710251788825]
We propose a novel framework, named as DANCER, for realistic single-person dance synthesis based on the most recent stable video diffusion model.<n>We introduce two important modules into our framework to fully benefit from the two inputs.<n>To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training.
arXiv Detail & Related papers (2025-10-31T04:42:08Z)
Learning World Models for Interactive Video Generation [20.793871778030113]
We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework.<n>We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors.<n>Our work establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
arXiv Detail & Related papers (2025-05-28T05:55:44Z)
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation [53.837937703425794]
LanDiff is a hybrid framework that synergizes the strengths of autoregressive language models and diffusion models.<n>Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D representations through efficient semantic compression; (2) a language model that generates semantic tokens with high-level semantic relationships; and (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos.
arXiv Detail & Related papers (2025-03-06T16:53:14Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators.<n>VideoJAM achieves state-of-the-art performance in motion coherence.<n>These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z)
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text. Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z)
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames. We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z)
Hybrid modeling: Applications in real-time diagnosis [64.5040763067757]
We outline a novel hybrid modeling approach that combines machine learning inspired models and physics-based models. We are using such models for real-time diagnosis applications.
arXiv Detail & Related papers (2020-03-04T00:44:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.