Related papers: Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction

URL: http://arxiv.org/abs/2412.04929v2
Date: Mon, 09 Dec 2024 02:54:53 GMT
Title: Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction
Authors: Gaurav Shrivastava, Abhinav Shrivastava,
Abstract summary: We introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames.<n>We establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101.
Score: 43.16308241800144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models have made significant strides in image generation, mastering tasks such as unconditional image synthesis, text-image translation, and image-to-image conversions. However, their capability falls short in the realm of video prediction, mainly because they treat videos as a collection of independent images, relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper, we introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames. We also report a reduction of 75\% sampling steps required to sample a new frame thus making our framework more efficient during the inference time. Through extensive experimentation, we establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project page https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.

Related papers

VDOT: Efficient Unified Video Creation via Optimal Transport Distillation [70.02065520468726]
We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
arXiv Detail & Related papers (2025-12-07T11:31:00Z)
Video Finetuning Improves Reasoning Between Frames [23.676284017808218]
We propose Visual Chain-of-Thought, an explicit reasoning process that generates transitional event descriptions between consecutive frames.<n>Our experiments show that vCoT significantly improves the performance of image-only models on long-form video question answering.<n>We find that video models transfer this temporal reasoning ability to purely static settings, outperforming image models' baselines on visual reasoning tasks.
arXiv Detail & Related papers (2025-11-17T01:51:57Z)
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models [10.66567645920237]
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the garment while maintaining temporal consistency. We reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence.
arXiv Detail & Related papers (2024-12-13T14:50:26Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
FusionFrames: Efficient Architectural Aspects for Text-to-Video Generation Pipeline [4.295130967329365]
This paper presents a new two-stage latent diffusion text-to-video generation architecture based on the text-to-image diffusion model. The design of our model significantly reduces computational costs compared to other masked frame approaches. We evaluate different configurations of MoVQ-based video decoding scheme to improve consistency and achieve higher PSNR, SSIM, MSE, and LPIPS scores.
arXiv Detail & Related papers (2023-11-22T00:26:15Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM. We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z)
Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Existing approaches usually align and aggregate video frames from limited adjacent frames. We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z)
Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference. We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z)
Leveraging Local Temporal Information for Multimodal Scene Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively. Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks. We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.