CoDeF: Content Deformation Fields for Temporally Consistent Video
Processing
- URL: http://arxiv.org/abs/2308.07926v1
- Date: Tue, 15 Aug 2023 17:59:56 GMT
- Title: CoDeF: Content Deformation Fields for Temporally Consistent Video
Processing
- Authors: Hao Ouyang, Qiuyu Wang, Yuxi Xiao, Qingyan Bai, Juntao Zhang, Kecheng
Zheng, Xiaowei Zhou, Qifeng Chen, Yujun Shen
- Abstract summary: CoDeF is a new type of video representation, which consists of a canonical content field and a temporal deformation field.
We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.
- Score: 89.49585127724941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the content deformation field CoDeF as a new type of video
representation, which consists of a canonical content field aggregating the
static contents in the entire video and a temporal deformation field recording
the transformations from the canonical image (i.e., rendered from the canonical
content field) to each individual frame along the time axis.Given a target
video, these two fields are jointly optimized to reconstruct it through a
carefully tailored rendering pipeline.We advisedly introduce some
regularizations into the optimization process, urging the canonical content
field to inherit semantics (e.g., the object shape) from the video.With such a
design, CoDeF naturally supports lifting image algorithms for video processing,
in the sense that one can apply an image algorithm to the canonical image and
effortlessly propagate the outcomes to the entire video with the aid of the
temporal deformation field.We experimentally show that CoDeF is able to lift
image-to-image translation to video-to-video translation and lift keypoint
detection to keypoint tracking without any training.More importantly, thanks to
our lifting strategy that deploys the algorithms on only one image, we achieve
superior cross-frame consistency in processed videos compared to existing
video-to-video translation approaches, and even manage to track non-rigid
objects like water and smog.Project page can be found at
https://qiuyu96.github.io/CoDeF/.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.