HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN
- URL: http://arxiv.org/abs/2008.09646v3
- Date: Sun, 06 Jul 2025 23:27:27 GMT
- Title: HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN
- Authors: Abhinav Sagar,
- Abstract summary: We propose a novel deep generative network architecture designed specifically for high-resolution video synthesis.<n>Our approach integrates key concepts from Wasserstein Generative Adrial Networks (WGANs)<n>Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-resolution video generation has emerged as a crucial task in computer vision, with wide-ranging applications in entertainment, simulation, and data augmentation. However, generating temporally coherent and visually realistic videos remains a significant challenge due to the high dimensionality and complex dynamics of video data. In this paper, we propose a novel deep generative network architecture designed specifically for high-resolution video synthesis. Our approach integrates key concepts from Wasserstein Generative Adversarial Networks (WGANs), enforcing a k-Lipschitz continuity constraint on the discriminator to stabilize training and enhance convergence. We further leverage Conditional GAN (cGAN) techniques by incorporating class labels during both training and inference, enabling class-specific video generation with improved semantic consistency. We provide a detailed layer-wise description of the Generator and Discriminator networks, highlighting architectural design choices promoting temporal coherence and spatial detail. The overall combined architecture, training algorithm, and optimization strategy are thoroughly presented. Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism. We validate our approach on benchmark datasets including UCF101, Golf, and Aeroplane, encompassing diverse motion patterns and scene contexts. Quantitative evaluations using Inception Score (IS) and Fr\'echet Inception Distance (FID) demonstrate that our model significantly outperforms previous state-of-the-art unsupervised video generation methods in terms of both quality and diversity.
Related papers
- VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning [21.35520258725298]
VQ-Insight is a novel reasoning-style framework for AIGC video quality assessment.<n>It combines image quality warm-up, general task-specific temporal learning, and joint optimization with the video generation model.<n>It consistently outperforms state-of-the-art baselines in preference comparison, multi-dimension scoring, and natural video scoring.
arXiv Detail & Related papers (2025-06-23T12:20:14Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO [73.33751812982342]
InfLVG is an inference-time framework that enables coherent long video generation without requiring additional long-form video data.<n>We show that InfLVG can extend video length by up to 9$times$, achieving strong consistency and semantic fidelity across scenes.
arXiv Detail & Related papers (2025-05-23T07:33:25Z) - Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline.<n>We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input [6.275971782566314]
We introduce a novel self-supervised stereo synthesis video paradigm via a video diffusion model, termed SpatialDreamer.<n>To address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG.<n>We also propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.
arXiv Detail & Related papers (2024-11-18T15:12:59Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Landslide Detection and Segmentation Using Remote Sensing Images and
Deep Neural Network [42.59806784981723]
Building upon findings from 2022 Landslide4Sense Competition, we propose a deep neural network based system for landslide detection and segmentation.
We use a U-Net trained with Cross Entropy loss as baseline model.
We then improve the U-Net baseline model by leveraging a wide range of deep learning techniques.
arXiv Detail & Related papers (2023-12-27T20:56:55Z) - Network state Estimation using Raw Video Analysis: vQoS-GAN based
non-intrusive Deep Learning Approach [5.8010446129208155]
vQoS GAN can estimate the network state parameters from the degraded received video data.
A robust and unique design of deep learning network model has been trained with the video data along with data rate and packet loss class labels.
The proposed semi supervised generative adversarial network can additionally reconstruct the degraded video data to its original form for a better end user experience.
arXiv Detail & Related papers (2022-03-22T10:42:19Z) - Temporal Graph Network Embedding with Causal Anonymous Walks
Representations [54.05212871508062]
We propose a novel approach for dynamic network representation learning based on Temporal Graph Network.
For evaluation, we provide a benchmark pipeline for the evaluation of temporal network embeddings.
We show the applicability and superior performance of our model in the real-world downstream graph machine learning task provided by one of the top European banks.
arXiv Detail & Related papers (2021-08-19T15:39:52Z) - Image Restoration by Deep Projected GSURE [115.57142046076164]
Ill-posed inverse problems appear in many image processing applications, such as deblurring and super-resolution.
We propose a new image restoration framework that is based on minimizing a loss function that includes a "projected-version" of the Generalized SteinUnbiased Risk Estimator (GSURE) and parameterization of the latent image by a CNN.
arXiv Detail & Related papers (2021-02-04T08:52:46Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z) - Monocular Depth Estimation Using Multi Scale Neural Network And Feature
Fusion [0.0]
Our network uses two different blocks, first which uses different filter sizes for convolution and merges all the individual feature maps.
The second block uses dilated convolutions in place of fully connected layers thus reducing computations and increasing the receptive field.
We train and test our network on Make 3D dataset, NYU Depth V2 dataset and Kitti dataset using standard evaluation metrics for depth estimation comprised of RMSE loss and SILog loss.
arXiv Detail & Related papers (2020-09-11T18:08:52Z) - Medical Image Segmentation Using a U-Net type of Architecture [0.0]
We argue that the architecture of U-Net, when combined with a supervised training strategy at the bottleneck layer, can produce comparable results with the original U-Net architecture.
We introduce a fully supervised FC layers based pixel-wise loss at the bottleneck of the encoder branch of U-Net.
The two layer based FC sub-net will train the bottleneck representation to contain more semantic information, which will be used by the decoder layers to predict the final segmentation map.
arXiv Detail & Related papers (2020-05-11T16:10:18Z) - Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs.
Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z) - Learning the Loss Functions in a Discriminative Space for Video
Restoration [48.104095018697556]
We propose a new framework for building effective loss functions by learning a discriminative space specific to a video restoration task.
Our framework is similar to GANs in that we iteratively train two networks - a generator and a loss network.
Experiments on video superresolution and deblurring show that our method generates visually more pleasing videos.
arXiv Detail & Related papers (2020-03-20T06:58:27Z) - A U-Net Based Discriminator for Generative Adversarial Networks [86.67102929147592]
We propose an alternative U-Net based discriminator architecture for generative adversarial networks (GANs)
The proposed architecture allows to provide detailed per-pixel feedback to the generator while maintaining the global coherence of synthesized images.
The novel discriminator improves over the state of the art in terms of the standard distribution and image quality metrics.
arXiv Detail & Related papers (2020-02-28T11:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.