Video 3D Sampling for Self-supervised Representation Learning
- URL: http://arxiv.org/abs/2107.03578v1
- Date: Thu, 8 Jul 2021 03:22:06 GMT
- Title: Video 3D Sampling for Self-supervised Representation Learning
- Authors: Wei Li, Dezhao Luo, Bo Fang, Yu Zhou, Weiping Wang
- Abstract summary: We propose a novel self-supervised method for video representation learning, referred to as Video 3D Sampling (V3S)
In our implementation, we combine the sampling of the three dimensions and propose the scale and projection transformations in space and time respectively.
The experimental results show that, when applied to action recognition, video retrieval and action similarity labeling, our approach improves the state-of-the-arts with significant margins.
- Score: 13.135859819622855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the existing video self-supervised methods mainly leverage temporal
signals of videos, ignoring that the semantics of moving objects and
environmental information are all critical for video-related tasks. In this
paper, we propose a novel self-supervised method for video representation
learning, referred to as Video 3D Sampling (V3S). In order to sufficiently
utilize the information (spatial and temporal) provided in videos, we
pre-process a video from three dimensions (width, height, time). As a result,
we can leverage the spatial information (the size of objects), temporal
information (the direction and magnitude of motions) as our learning target. In
our implementation, we combine the sampling of the three dimensions and propose
the scale and projection transformations in space and time respectively. The
experimental results show that, when applied to action recognition, video
retrieval and action similarity labeling, our approach improves the
state-of-the-arts with significant margins.
Related papers
- Learning-based Multi-View Stereo: A Survey [55.3096230732874]
Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments.
With the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods.
arXiv Detail & Related papers (2024-08-27T17:53:18Z) - Flatten: Video Action Recognition is an Image Classification task [15.518011818978074]
A novel video representation architecture, Flatten, serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network.
Experiments on commonly used datasets have demonstrated that embedding Flatten provides significant performance improvements over original model.
arXiv Detail & Related papers (2024-08-17T14:59:58Z) - Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model [52.27297680947337]
Multimodal language models (MLLMs) are increasingly being implemented in real-world environments.
Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions.
We introduce Coarse Correspondence, a training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net [15.032516344808526]
We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
arXiv Detail & Related papers (2021-06-19T16:27:19Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Self-Supervised Multi-View Learning via Auto-Encoding 3D Transformations [61.870882736758624]
We propose a novel self-supervised paradigm to learn Multi-View Transformation Equivariant Representations (MV-TER)
Specifically, we perform a 3D transformation on a 3D object, and obtain multiple views before and after the transformation via projection.
Then, we self-train a representation to capture the intrinsic 3D object representation by decoding 3D transformation parameters from the fused feature representations of multiple views before and after the transformation.
arXiv Detail & Related papers (2021-03-01T06:24:17Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.