Unified Dense Prediction of Video Diffusion
- URL: http://arxiv.org/abs/2503.09344v1
- Date: Wed, 12 Mar 2025 12:41:02 GMT
- Title: Unified Dense Prediction of Video Diffusion
- Authors: Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang,
- Abstract summary: We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts.<n>We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation.
- Score: 91.16237431830417
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset~\datasetname, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.
Related papers
- VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion [9.465414294387507]
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption.
We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models.
We introduce a large-scale video referring matting dataset with 10,000 videos.
arXiv Detail & Related papers (2025-03-11T06:12:35Z) - VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content.<n>We propose a novel dual-stream paradigm VideoPainter to process masked videos.<n>We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z) - SpecDM: Hyperspectral Dataset Synthesis with Pixel-level Semantic Annotations [27.391859339238906]
In this paper, we explore the potential of generative diffusion model in synthesizing hyperspectral images with pixel-level annotations.<n>To the best of our knowledge, it is the first work to generate high-dimensional HSIs with annotations.<n>We select two of the most widely used dense prediction tasks: semantic segmentation and change detection, and generate datasets suitable for these tasks.
arXiv Detail & Related papers (2025-02-24T11:13:37Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - Coherent Loss: A Generic Framework for Stable Video Segmentation [103.78087255807482]
We investigate how a jittering artifact degrades the visual quality of video segmentation results.
We propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts.
arXiv Detail & Related papers (2020-10-25T10:48:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.