Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval
- URL: http://arxiv.org/abs/2310.08009v1
- Date: Thu, 12 Oct 2023 03:21:12 GMT
- Title: Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval
- Authors: Pandeng Li, Hongtao Xie, Jiannan Ge, Lei Zhang, Shaobo Min, Yongdong
Zhang
- Abstract summary: We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
- Score: 67.52910255064762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised video hashing usually optimizes binary codes by learning to
reconstruct input videos. Such reconstruction constraint spends much effort on
frame-level temporal context changes without focusing on video-level global
semantics that are more useful for retrieval. Hence, we address this problem by
decomposing video information into reconstruction-dependent and
semantic-dependent information, which disentangles the semantic extraction from
reconstruction constraint. Specifically, we first design a simple dual-stream
structure, including a temporal layer and a hash layer. Then, with the help of
semantic similarity knowledge obtained from self-supervision, the hash layer
learns to capture information for semantic retrieval, while the temporal layer
learns to capture the information for reconstruction. In this way, the model
naturally preserves the disentangled semantics into binary codes. Validated by
comprehensive experiments, our method consistently outperforms the
state-of-the-arts on three video benchmarks.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved
Self-Supervised Video Hashing [45.216750448864275]
Learn accurate hash for video retrieval can be challenging due to high local redundancy and complex global video frames.
Our proposed Contrastive Hash-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.
arXiv Detail & Related papers (2023-10-29T07:36:11Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - SELF-VS: Self-supervised Encoding Learning For Video Summarization [6.21295508577576]
We propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder.
Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification.
arXiv Detail & Related papers (2023-03-28T14:08:05Z) - An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU) [0.0]
We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
arXiv Detail & Related papers (2023-01-06T10:00:06Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.