C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal
Consistent Motion Transfer
- URL: http://arxiv.org/abs/2012.08976v1
- Date: Wed, 16 Dec 2020 14:11:13 GMT
- Title: C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal
Consistent Motion Transfer
- Authors: Dongxu Wei, Xiaowei Xu, Haibin Shen, Kejie Huang
- Abstract summary: We propose Coarse-to-Fine Flow Warping Network (C2F-FWN) for spatial-temporal consistent HVMT.
C2F-FWN employs Flow Temporal Consistency (FTC) Loss to enhance temporal consistency.
Our approach outperforms state-of-art HVMT methods in terms of both spatial and temporal consistency.
- Score: 5.220611885921671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human video motion transfer (HVMT) aims to synthesize videos that one person
imitates other persons' actions. Although existing GAN-based HVMT methods have
achieved great success, they either fail to preserve appearance details due to
the loss of spatial consistency between synthesized and exemplary images, or
generate incoherent video results due to the lack of temporal consistency among
video frames. In this paper, we propose Coarse-to-Fine Flow Warping Network
(C2F-FWN) for spatial-temporal consistent HVMT. Particularly, C2F-FWN utilizes
coarse-to-fine flow warping and Layout-Constrained Deformable Convolution
(LC-DConv) to improve spatial consistency, and employs Flow Temporal
Consistency (FTC) Loss to enhance temporal consistency. In addition, provided
with multi-source appearance inputs, C2F-FWN can support appearance attribute
editing with great flexibility and efficiency. Besides public datasets, we also
collected a large-scale HVMT dataset named SoloDance for evaluation. Extensive
experiments conducted on our SoloDance dataset and the iPER dataset show that
our approach outperforms state-of-art HVMT methods in terms of both spatial and
temporal consistency. Source code and the SoloDance dataset are available at
https://github.com/wswdx/C2F-FWN.
Related papers
- Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - Collaborative Feedback Discriminative Propagation for Video Super-Resolution [66.61201445650323]
Key success of video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information.
Inaccurate alignment usually leads to aligned features with significant artifacts.
propagation modules only propagate the same timestep features forward or backward.
arXiv Detail & Related papers (2024-04-06T22:08:20Z) - FLAIR: A Conditional Diffusion Framework with Applications to Face Video
Restoration [14.17192434286707]
We present a new conditional diffusion framework called FLAIR for face video restoration.
FLAIR ensures temporal consistency across frames in a computationally efficient fashion.
Our experiments show superiority of FLAIR over the current state-of-the-art (SOTA) for video super-resolution, deblurring, JPEG restoration, and space-time frame on two high-quality face video datasets.
arXiv Detail & Related papers (2023-11-26T22:09:18Z) - Spatial-Temporal Transformer based Video Compression Framework [44.723459144708286]
We propose a novel Spatial-Temporal Transformer based Video Compression (STT-VC) framework.
It contains a Relaxed Deformable Transformer (RDT) with Uformer based offsets estimation for motion estimation and compensation, a Multi-Granularity Prediction (MGP) module based on multi-reference frames for prediction refinement, and a Spatial Feature Distribution prior based Transformer (SFD-T) for efficient temporal-spatial joint residual compression.
Experimental results demonstrate that our method achieves the best result with 13.5% BD-Rate saving over VTM.
arXiv Detail & Related papers (2023-09-21T09:23:13Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing.
T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z) - Conditional Image-to-Video Generation with Latent Flow Diffusion Models [18.13991670747915]
Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image and a condition.
We propose an approach for cI2V using novel latent flow diffusion models (LFDM)
LFDM synthesizes an optical flow sequence in the latent space based on the given condition to warp the given image.
arXiv Detail & Related papers (2023-03-24T01:54:26Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.