3D-CSL: self-supervised 3D context similarity learning for
Near-Duplicate Video Retrieval
- URL: http://arxiv.org/abs/2211.05352v1
- Date: Thu, 10 Nov 2022 05:51:08 GMT
- Title: 3D-CSL: self-supervised 3D context similarity learning for
Near-Duplicate Video Retrieval
- Authors: Rui Deng, Qian Wu, Yuke Li
- Abstract summary: We introduce 3D-SL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR)
We propose a two-stage self-supervised similarity learning strategy to optimize the network.
Our method achieves the state-of-the-art performance on clip-level NDVR.
- Score: 17.69904571043164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce 3D-CSL, a compact pipeline for Near-Duplicate
Video Retrieval (NDVR), and explore a novel self-supervised learning strategy
for video similarity learning. Most previous methods only extract video spatial
features from frames separately and then design kinds of complex mechanisms to
learn the temporal correlations among frame features. However, parts of
spatiotemporal dependencies have already been lost. To address this, our 3D-CSL
extracts global spatiotemporal dependencies in videos end-to-end with a 3D
transformer and find a good balance between efficiency and effectiveness by
matching on clip-level. Furthermore, we propose a two-stage self-supervised
similarity learning strategy to optimize the entire network. Firstly, we
propose PredMAE to pretrain the 3D transformer with video prediction task;
Secondly, ShotMix, a novel video-specific augmentation, and FCS loss, a novel
triplet loss, are proposed further promote the similarity learning results. The
experiments on FIVR-200K and CC_WEB_VIDEO demonstrate the superiority and
reliability of our method, which achieves the state-of-the-art performance on
clip-level NDVR.
Related papers
- RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks [93.18404922542702]
We present a novel video generative model designed to address long-term spatial and temporal dependencies.
Our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks.
Our model synthesizes high-fidelity video clips at a resolution of $256times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps.
arXiv Detail & Related papers (2024-01-11T16:48:44Z) - Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Optimization Planning for 3D ConvNets [123.43419144051703]
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme.
We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state.
We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
arXiv Detail & Related papers (2022-01-11T16:13:31Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition [24.220358793070965]
We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv)
We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames.
The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-22T09:20:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.