JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting
- URL: http://arxiv.org/abs/2106.09679v1
- Date: Thu, 17 Jun 2021 17:32:32 GMT
- Title: JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion
Retargeting
- Authors: Ron Mokady, Rotem Tzaban, Sagie Benaim, Amit H. Bermano and Daniel
Cohen-Or
- Abstract summary: unsupervised motion in videos has seen substantial advancements through the use of deep neural networks.
We introduce JOKR - a JOint Keypoint Representation that handles both the source and target videos, without requiring any object prior or data collection.
We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans.
- Score: 53.28477676794658
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The task of unsupervised motion retargeting in videos has seen substantial
advancements through the use of deep neural networks. While early works
concentrated on specific object priors such as a human face or body, recent
work considered the unsupervised case. When the source and target videos,
however, are of different shapes, current methods fail. To alleviate this
problem, we introduce JOKR - a JOint Keypoint Representation that captures the
motion common to both the source and target videos, without requiring any
object prior or data collection. By employing a domain confusion term, we
enforce the unsupervised keypoint representations of both videos to be
indistinguishable. This encourages disentanglement between the parts of the
motion that are common to the two domains, and their distinctive appearance and
motion, enabling the generation of videos that capture the motion of the one
while depicting the style of the other. To enable cases where the objects are
of different proportions or orientations, we apply a learned affine
transformation between the JOKRs. This augments the representation to be affine
invariant, and in practice broadens the variety of possible retargeting pairs.
This geometry-driven representation enables further intuitive control, such as
temporal coherence and manual editing. Through comprehensive experimentation,
we demonstrate the applicability of our method to different challenging
cross-domain video pairs. We evaluate our method both qualitatively and
quantitatively, and demonstrate that our method handles various cross-domain
scenarios, such as different animals, different flowers, and humans. We also
demonstrate superior temporal coherency and visual quality compared to
state-of-the-art alternatives, through statistical metrics and a user study.
Source code and videos can be found at https://rmokady.github.io/JOKR/ .
Related papers
- Unsupervised Video Domain Adaptation for Action Recognition: A
Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors.
TranSVAE framework is then developed to model such generation.
Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z) - Guess What Moves: Unsupervised Video and Image Segmentation by
Anticipating Motion [92.80981308407098]
We propose an approach that combines the strengths of motion-based and appearance-based segmentation.
We propose to supervise an image segmentation network, tasking it with predicting regions that are likely to contain simple motion patterns.
In the unsupervised video segmentation mode, the network is trained on a collection of unlabelled videos, using the learning process itself as an algorithm to segment these videos.
arXiv Detail & Related papers (2022-05-16T17:55:34Z) - The Right Spin: Learning Object Motion from Rotation-Compensated Flow
Fields [61.664963331203666]
How humans perceive moving objects is a longstanding research question in computer vision.
One approach to the problem is to teach a deep network to model all of these effects.
We present a novel probabilistic model to estimate the camera's rotation given the motion field.
arXiv Detail & Related papers (2022-02-28T22:05:09Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Self-Supervised Keypoint Discovery in Behavioral Videos [37.367739727481016]
We propose a method for learning the posture and structure of agents from unlabelled behavioral videos.
Our method uses an encoder-decoder architecture with a geometric bottleneck to reconstruct the difference between video frames.
By focusing only on regions of movement, our approach works directly on input videos without requiring manual annotations.
arXiv Detail & Related papers (2021-12-09T18:55:53Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - On Development and Evaluation of Retargeting Human Motion and Appearance
in Monocular Videos [2.870762512009438]
Transferring human motion and appearance between videos of human actors remains one of the key challenges in Computer Vision.
We propose a novel and high-performant approach based on a hybrid image-based rendering technique that exhibits competitive visual quality.
We also present a new video benchmark dataset composed of different videos with annotated human motions to evaluate the task of synthesizing people's videos.
arXiv Detail & Related papers (2021-03-29T13:17:41Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Cross-Identity Motion Transfer for Arbitrary Objects through
Pose-Attentive Video Reassembling [40.20163225821707]
Given a source image and a driving video, our networks animate the subject in the source images according to the motion in the driving video.
In our attention mechanism, dense similarities between the learned keypoints in the source and the driving images are computed.
To reduce the training-testing discrepancy of the self-supervised learning, a novel cross-identity training scheme is additionally introduced.
arXiv Detail & Related papers (2020-07-17T07:21:12Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.