Contrastive Masked Autoencoders for Self-Supervised Video Hashing
- URL: http://arxiv.org/abs/2211.11210v2
- Date: Wed, 23 Nov 2022 15:04:55 GMT
- Title: Contrastive Masked Autoencoders for Self-Supervised Video Hashing
- Authors: Yuting Wang, Jinpeng Wang, Bin Chen, Ziyun Zeng, Shutao Xia
- Abstract summary: Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
- Score: 54.636976693527636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Supervised Video Hashing (SSVH) models learn to generate short binary
representations for videos without ground-truth supervision, facilitating
large-scale video retrieval efficiency and attracting increasing research
attention. The success of SSVH lies in the understanding of video content and
the ability to capture the semantic relation among unlabeled videos. Typically,
state-of-the-art SSVH methods consider these two points in a two-stage training
pipeline, where they firstly train an auxiliary network by instance-wise
mask-and-predict tasks and secondly train a hashing model to preserve the
pseudo-neighborhood structure transferred from the auxiliary network. This
consecutive training strategy is inflexible and also unnecessary. In this
paper, we propose a simple yet effective one-stage SSVH method called ConMH,
which incorporates video semantic information and video similarity relationship
understanding in a single stage. To capture video semantic information for
better hashing learning, we adopt an encoder-decoder structure to reconstruct
the video from its temporal-masked frames. Particularly, we find that a higher
masking ratio helps video understanding. Besides, we fully exploit the
similarity relationship between videos by maximizing agreement between two
augmented views of a video, which contributes to more discriminative and robust
hash codes. Extensive experiments on three large-scale video datasets (i.e.,
FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art
results. Code is available at https://github.com/huangmozhi9527/ConMH.
Related papers
- Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval [67.52910255064762]
We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
arXiv Detail & Related papers (2023-10-12T03:21:12Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.