Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
- URL: http://arxiv.org/abs/2303.11003v2
- Date: Thu, 28 Sep 2023 14:43:03 GMT
- Title: Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization
- Authors: Fida Mohammad Thoker, Hazel Doughty, Cees Snoek
- Abstract summary: We propose a self-supervised method for learning motion-focused video representations.
We learn similarities between videos with identical local motion dynamics but an otherwise different appearance.
Our approach maintains performance when using only 25% of the pretraining videos.
- Score: 23.245275661852446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a self-supervised method for learning motion-focused video
representations. Existing approaches minimize distances between temporally
augmented videos, which maintain high spatial similarity. We instead propose to
learn similarities between videos with identical local motion dynamics but an
otherwise different appearance. We do so by adding synthetic motion
trajectories to videos which we refer to as tubelets. By simulating different
tubelet motions and applying transformations, such as scaling and rotation, we
introduce motion patterns beyond what is present in the pretraining data. This
allows us to learn a video representation that is remarkably data efficient:
our approach maintains performance when using only 25\% of the pretraining
videos. Experiments on 10 diverse downstream settings demonstrate our
competitive performance and generalizability to new domains and fine-grained
actions.
Related papers
- Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics [67.97235923372035]
We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics.
At test time, given a single image and a sparse set of motion trajectories, Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions.
arXiv Detail & Related papers (2024-08-08T17:59:38Z) - Controllable Longer Image Animation with Diffusion Models [12.565739255499594]
We introduce an open-domain controllable image animation method using motion priors with video diffusion models.
Our method achieves precise control over the direction and speed of motion in the movable region by extracting the motion field information from videos.
We propose an efficient long-duration video generation method based on noise reschedule specifically tailored for image animation tasks.
arXiv Detail & Related papers (2024-05-27T16:08:00Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments.
We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge.
Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - MotionDirector: Motion Customization of Text-to-Video Diffusion Models [24.282240656366714]
Motion Customization aims to adapt existing text-to-video diffusion models to generate videos with customized motion.
We propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion.
Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions.
arXiv Detail & Related papers (2023-10-12T16:26:18Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Motion-Augmented Self-Training for Video Recognition at Smaller Scale [32.73585552425734]
We propose the first motion-augmented self-training regime, we call MotionFit.
We generate pseudo-labels for a large unlabeled video collection, which enables us to transfer knowledge by learning to predict these pseudo-labels with an appearance model.
We obtain a strong motion-augmented representation model suited for video downstream tasks like action recognition and clip retrieval.
arXiv Detail & Related papers (2021-05-04T17:43:19Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.