MV2MAE: Multi-View Video Masked Autoencoders
- URL: http://arxiv.org/abs/2401.15900v1
- Date: Mon, 29 Jan 2024 05:58:23 GMT
- Title: MV2MAE: Multi-View Video Masked Autoencoders
- Authors: Ketul Shah, Robert Crandall, Jie Xu, Peng Zhou, Marian George, Mayank
Bansal, Rama Chellappa
- Abstract summary: We present a method for self-supervised learning from synchronized multi-view videos.
We use a cross-view reconstruction task to inject geometry information in the model.
Our approach is based on the masked autoencoder (MAE) framework.
- Score: 33.61642891911761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos captured from multiple viewpoints can help in perceiving the 3D
structure of the world and benefit computer vision tasks such as action
recognition, tracking, etc. In this paper, we present a method for
self-supervised learning from synchronized multi-view videos. We use a
cross-view reconstruction task to inject geometry information in the model. Our
approach is based on the masked autoencoder (MAE) framework. In addition to the
same-view decoder, we introduce a separate cross-view decoder which leverages
cross-attention mechanism to reconstruct a target viewpoint video using a video
from source viewpoint, to help representations robust to viewpoint changes. For
videos, static regions can be reconstructed trivially which hinders learning
meaningful representations. To tackle this, we introduce a motion-weighted
reconstruction loss which improves temporal modeling. We report
state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets, as well as
in the transfer learning setting on NUCLA, PKU-MMD-II and ROCOG-v2 datasets,
demonstrating the robustness of our approach. Code will be made available.
Related papers
- AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - Learning to Deblur and Rotate Motion-Blurred Faces [43.673660541417995]
We train a neural network to reconstruct a 3D video representation from a single image and the corresponding face gaze.
We then provide a camera viewpoint relative to the estimated gaze and the blurry image as input to an encoder-decoder network to generate a video of sharp frames with a novel camera viewpoint.
arXiv Detail & Related papers (2021-12-14T17:51:19Z) - Support-Set Based Cross-Supervision for Video Grounding [98.29089558426399]
Support-set Based Cross-Supervision (Sscs) module can improve existing methods during training phase without extra inference cost.
The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective.
We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins.
arXiv Detail & Related papers (2021-08-24T08:25:26Z) - Cycle-Contrast for Self-Supervised Video Representation Learning [10.395615031496064]
We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation.
In our method, the frame and video representations are learned from a single network based on an R3D architecture.
We demonstrate that the video representation learned by CCL can be transferred well to downstream tasks of video understanding.
arXiv Detail & Related papers (2020-10-28T08:27:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.