Learning Temporally and Semantically Consistent Unpaired Video-to-video
Translation Through Pseudo-Supervision From Synthetic Optical Flow
- URL: http://arxiv.org/abs/2201.05723v1
- Date: Sat, 15 Jan 2022 01:10:34 GMT
- Title: Learning Temporally and Semantically Consistent Unpaired Video-to-video
Translation Through Pseudo-Supervision From Synthetic Optical Flow
- Authors: Kaihong Wang, Kumar Akash, Teruhisa Misu
- Abstract summary: Unpaired-to-video translation aims to translate videos between a source and a target domain without the need of paired training data, making it more feasible for real applications.
We propose a paradigm that regularizes video consistency by synthesizing novel motions in input videos with the generated optical flow instead of estimating them.
- Score: 5.184108122340348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unpaired video-to-video translation aims to translate videos between a source
and a target domain without the need of paired training data, making it more
feasible for real applications. Unfortunately, the translated videos generally
suffer from temporal and semantic inconsistency. To address this, many existing
works adopt spatiotemporal consistency constraints incorporating temporal
information based on motion estimation. However, the inaccuracies in the
estimation of motion deteriorate the quality of the guidance towards
spatiotemporal consistency, which leads to unstable translation. In this work,
we propose a novel paradigm that regularizes the spatiotemporal consistency by
synthesizing motions in input videos with the generated optical flow instead of
estimating them. Therefore, the synthetic motion can be applied in the
regularization paradigm to keep motions consistent across domains without the
risk of errors in motion estimation. Thereafter, we utilize our unsupervised
recycle and unsupervised spatial loss, guided by the pseudo-supervision
provided by the synthetic optical flow, to accurately enforce spatiotemporal
consistency in both domains. Experiments show that our method is versatile in
various scenarios and achieves state-of-the-art performance in generating
temporally and semantically consistent videos. Code is available at:
https://github.com/wangkaihong/Unsup_Recycle_GAN/.
Related papers
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Segmenting the motion components of a video: A long-term unsupervised model [5.801044612920816]
We want to provide a coherent and stable motion segmentation over the video sequence.
We propose a novel long-term optical-temporal model operating in a totally unsupervised way.
We report experiments on four VOS, demonstrating competitive quantitative results.
arXiv Detail & Related papers (2023-10-02T09:33:54Z) - STint: Self-supervised Temporal Interpolation for Geospatial Data [0.0]
Supervised and unsupervised techniques have demonstrated the potential for temporal of video data.
Most prevailing temporal techniques hinge on optical flow, which encodes the motion of pixels between video frames.
In this work, we propose an unsupervised temporal technique, which does not rely on ground truth data or require any motion information like optical flow.
arXiv Detail & Related papers (2023-08-31T18:04:50Z) - Unsupervised Learning Optical Flow in Multi-frame Dynamic Environment
Using Temporal Dynamic Modeling [7.111443975103329]
In this paper, we explore the optical flow estimation from multiple-frame sequences of dynamic scenes.
We use motion priors of the adjacent frames to provide more reliable supervision of the occluded regions.
Experiments on KITTI 2012, KITTI 2015, Sintel Clean, and Sintel Final datasets demonstrate the effectiveness of our methods.
arXiv Detail & Related papers (2023-04-14T14:32:02Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z) - Dynamic View Synthesis from Dynamic Monocular Video [69.80425724448344]
We present an algorithm for generating views at arbitrary viewpoints and any input time step given a monocular video of a dynamic scene.
We show extensive quantitative and qualitative results of dynamic view synthesis from casually captured videos.
arXiv Detail & Related papers (2021-05-13T17:59:50Z) - Long-Term Temporally Consistent Unpaired Video Translation from
Simulated Surgical 3D Data [0.059110875077162096]
We propose a novel approach which combines unpaired image translation with neural rendering to transfer simulated to photorealistic surgical abdominal scenes.
By introducing global learnable textures and a lighting-invariant view-consistency loss, our method produces consistent translations of arbitrary views.
By extending existing image-based methods to view-consistent videos, we aim to impact the applicability of simulated training and evaluation environments for surgical applications.
arXiv Detail & Related papers (2021-03-31T16:31:26Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.