Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis
- URL: http://arxiv.org/abs/2012.06134v1
- Date: Fri, 11 Dec 2020 05:29:45 GMT
- Title: Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis
- Authors: Lingbo Yang, Zhanning Gao, Peiran Ren, Siwei Ma, Wen Gao
- Abstract summary: temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
- Score: 59.54483950973432
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporal consistency is crucial for extending image processing pipelines to
the video domain, which is often enforced with flow-based warping error over
adjacent frames. Yet for human video synthesis, such scheme is less reliable
due to the misalignment between source and target video as well as the
difficulty in accurate flow estimation. In this paper, we propose an effective
intrinsic temporal regularization scheme to mitigate these issues, where an
intrinsic confidence map is estimated via the frame generator to regulate
motion estimation via temporal loss modulation. This creates a shortcut for
back-propagating temporal loss gradients directly to the front-end motion
estimator, thus improving training stability and temporal coherence in output
videos. We apply our intrinsic temporal regulation to single-image generator,
leading to a powerful "INTERnet" capable of generating $512\times512$
resolution human action videos with temporal-coherent, realistic visual
details. Extensive experiments demonstrate the superiority of proposed INTERnet
over several competitive baselines.
Related papers
- Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise.
One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions.
We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - STint: Self-supervised Temporal Interpolation for Geospatial Data [0.0]
Supervised and unsupervised techniques have demonstrated the potential for temporal of video data.
Most prevailing temporal techniques hinge on optical flow, which encodes the motion of pixels between video frames.
In this work, we propose an unsupervised temporal technique, which does not rely on ground truth data or require any motion information like optical flow.
arXiv Detail & Related papers (2023-08-31T18:04:50Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Continuous Space-Time Video Super-Resolution Utilizing Long-Range
Temporal Information [48.20843501171717]
We propose a continuous ST-VSR (CSTVSR) method that can convert the given video to any frame rate and spatial resolution.
We show that the proposed algorithm has good flexibility and achieves better performance on various datasets.
arXiv Detail & Related papers (2023-02-26T08:02:39Z) - Distortion-Aware Network Pruning and Feature Reuse for Real-time Video
Segmentation [49.17930380106643]
We propose a novel framework to speed up any architecture with skip-connections for real-time vision tasks.
Specifically, at the arrival of each frame, we transform the features from the previous frame to reuse them at specific spatial bins.
We then perform partial computation of the backbone network on the regions of the current frame that captures temporal differences between the current and previous frame.
arXiv Detail & Related papers (2022-06-20T07:20:02Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Learning Temporally and Semantically Consistent Unpaired Video-to-video
Translation Through Pseudo-Supervision From Synthetic Optical Flow [5.184108122340348]
Unpaired-to-video translation aims to translate videos between a source and a target domain without the need of paired training data, making it more feasible for real applications.
We propose a paradigm that regularizes video consistency by synthesizing novel motions in input videos with the generated optical flow instead of estimating them.
arXiv Detail & Related papers (2022-01-15T01:10:34Z) - Consistency Guided Scene Flow Estimation [159.24395181068218]
CGSF is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video.
We show that the proposed model can reliably predict disparity and scene flow in challenging imagery.
It achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains.
arXiv Detail & Related papers (2020-06-19T17:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.