Abstract: We introduce a self-supervised motion-transfer VAE model to disentangle
motion and content from video. Unlike previous work regarding content-motion
disentanglement in videos, we adopt a chunk-wise modeling approach and take
advantage of the motion information contained in spatiotemporal neighborhoods.
Our model yields per-chunk representations that can be modeled independently
and preserve temporal consistency. Hence, we reconstruct whole videos in a
single forward-pass. We extend the ELBO's log-likelihood term and include a
Blind Reenactment Loss as inductive bias to leverage motion disentanglement,
under the assumption that swapping motion features yields reenactment between
two videos. We test our model on recently-proposed disentanglement metrics, and
show that it outperforms a variety of methods for video motion-content
disentanglement. Experiments on video reenactment show the effectiveness of our
disentanglement in the input space where our model outperforms the baselines in
reconstruction quality and motion alignment.