Unfolding Videos Dynamics via Taylor Expansion
- URL: http://arxiv.org/abs/2409.02371v2
- Date: Sat, 7 Sep 2024 16:15:11 GMT
- Title: Unfolding Videos Dynamics via Taylor Expansion
- Authors: Siyi Chen, Minkyu Choi, Zesen Zhao, Kuan Han, Qing Qu, Zhongming Liu,
- Abstract summary: We present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi)
ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence.
ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings.
- Score: 5.723852805622308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taking inspiration from physical motion, we present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi). ViDiDi is a simple and data-efficient strategy, readily applicable to existing self-supervised video representation learning frameworks based on instance discrimination. At its core, ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence. These derivatives, along with the original frames, support the Taylor series expansion of the underlying continuous dynamics at discrete times, where higher-order derivatives emphasize higher-order motion features. ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings following a balanced alternating learning algorithm. By learning consistent representations for original frames and derivatives, the encoder is steered to emphasize motion features over static backgrounds and uncover the hidden dynamics in original frames. Hence, video representations are better separated by dynamic features. We integrate ViDiDi into existing instance discrimination frameworks (VICReg, BYOL, and SimCLR) for pretraining on UCF101 or Kinetics and test on standard benchmarks including video retrieval, action recognition, and action detection. The performances are enhanced by a significant margin without the need for large models or extensive datasets.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Dynamic Appearance: A Video Representation for Action Recognition with
Joint Training [11.746833714322154]
We introduce a new concept, Dynamic Appearance (DA), summarizing the appearance information relating to movement in a video.
We consider distilling the dynamic appearance from raw video data as a means of efficient video understanding.
We provide extensive experimental results on four action recognition benchmarks.
arXiv Detail & Related papers (2022-11-23T07:16:16Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Generating Videos with Dynamics-aware Implicit Generative Adversarial
Networks [68.93429034530077]
We propose dynamics-aware implicit generative adversarial network (DIGAN) for video generation.
We show that DIGAN can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.
arXiv Detail & Related papers (2022-02-21T23:24:01Z) - Exploring Discontinuity for Video Frame Interpolation [7.061238509514182]
We propose three techniques to make the existing deep learning-based VFI architectures robust to discontinuous motions.
First is a novel data augmentation strategy called figure-text mixing (FTM) which can make the models learn discontinuous motions.
Second, we propose a simple but effective module that predicts a map called discontinuity map (D-map) which densely distinguishes between areas of continuous and discontinuous motions.
arXiv Detail & Related papers (2022-02-15T10:17:02Z) - Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.