Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval
- URL: http://arxiv.org/abs/2310.08009v1
- Date: Thu, 12 Oct 2023 03:21:12 GMT
- Title: Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval
- Authors: Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong
Zhang
- Abstract summary: We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
- Score: 67.52910255064762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised video hashing usually optimizes binary codes by learning to
reconstruct input videos. Such reconstruction constraint spends much effort on
frame-level temporal context changes without focusing on video-level global
semantics that are more useful for retrieval. Hence, we address this problem by
decomposing video information into reconstruction-dependent and
semantic-dependent information, which disentangles the semantic extraction from
reconstruction constraint. Specifically, we first design a simple dual-stream
structure, including a temporal layer and a hash layer. Then, with the help of
semantic similarity knowledge obtained from self-supervision, the hash layer
learns to capture information for semantic retrieval, while the temporal layer
learns to capture the information for reconstruction. In this way, the model
naturally preserves the disentangled semantics into binary codes. Validated by
comprehensive experiments, our method consistently outperforms the
state-of-the-arts on three video benchmarks.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved
Self-Supervised Video Hashing [45.216750448864275]
Learn accurate hash for video retrieval can be challenging due to high local redundancy and complex global video frames.
Our proposed Contrastive Hash-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.
arXiv Detail & Related papers (2023-10-29T07:36:11Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - SELF-VS: Self-supervised Encoding Learning For Video Summarization [6.21295508577576]
We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder.
Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
arXiv Detail & Related papers (2023-03-28T14:08:05Z) - Making Reconstruction-based Method Great Again for Video Anomaly
Detection [64.19326819088563]
Anomaly detection in videos is a significant yet challenging problem.
Existing reconstruction-based methods rely on old-fashioned convolutional autoencoders.
We propose a new autoencoder model for enhanced consecutive frame reconstruction.
arXiv Detail & Related papers (2023-01-28T01:57:57Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - Context Recovery and Knowledge Retrieval: A Novel Two-Stream Framework
for Video Anomaly Detection [48.05512963355003]
We propose a two-stream framework based on context recovery and knowledge retrieval.
For the context recovery stream, we propose a U-Net which can fully utilize the motion information to predict the future frame.
For the knowledge retrieval stream, we propose an improved learnable locality-sensitive hashing.
The knowledge about normality is encoded and stored in hash tables, and the distance between the testing event and the knowledge representation is used to reveal the probability of anomaly.
arXiv Detail & Related papers (2022-09-07T03:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.