Efficient Self-Supervised Video Hashing with Selective State Spaces
- URL: http://arxiv.org/abs/2412.14518v1
- Date: Thu, 19 Dec 2024 04:33:22 GMT
- Title: Efficient Self-Supervised Video Hashing with Selective State Spaces
- Authors: Jinpeng Wang, Niu Lian, Jun Li, Yuting Wang, Yan Feng, Bin Chen, Yongbing Zhang, Shu-Tao Xia,
- Abstract summary: Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval.
We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm.
- Score: 63.83300352372051
- License:
- Abstract: Self-supervised video hashing (SSVH) is a practical task in video indexing and retrieval. Although Transformers are predominant in SSVH for their impressive temporal modeling capabilities, they often suffer from computational and memory inefficiencies. Drawing inspiration from Mamba, an advanced state-space model, we explore its potential in SSVH to achieve a better balance between efficacy and efficiency. We introduce S5VH, a Mamba-based video hashing model with an improved self-supervised learning paradigm. Specifically, we design bidirectional Mamba layers for both the encoder and decoder, which are effective and efficient in capturing temporal relationships thanks to the data-dependent selective scanning mechanism with linear complexity. In our learning strategy, we transform global semantics in the feature space into semantically consistent and discriminative hash centers, followed by a center alignment loss as a global learning signal. Our self-local-global (SLG) paradigm significantly improves learning efficiency, leading to faster and better convergence. Extensive experiments demonstrate S5VH's improvements over state-of-the-art methods, superior transferability, and scalable advantages in inference efficiency. Code is available at https://github.com/gimpong/AAAI25-S5VH.
Related papers
- ContextFormer: Redefining Efficiency in Semantic Segmentation [46.06496660333768]
Convolutional methods, although capturing local dependencies well, struggle with long-range relationships.
Vision Transformers (ViTs) excel in global context capture but are hindered by high computational demands.
We propose ContextFormer, a hybrid framework leveraging the strengths of CNNs and ViTs in the bottleneck to balance efficiency, accuracy, and robustness for real-time semantic segmentation.
arXiv Detail & Related papers (2025-01-31T16:11:04Z) - MobileMamba: Lightweight Multi-Receptive Visual Mamba Network [51.33486891724516]
Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs.
We propose the MobileMamba framework, which balances efficiency and performance.
MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods.
arXiv Detail & Related papers (2024-11-24T18:01:05Z) - Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation [22.653014803666668]
We propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features.
We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer.
We evaluated the framework on datasets and nuScenes, achieving a 4x reduction in resource consumption and a 1-2% performance improvement over the current SoTA methods.
arXiv Detail & Related papers (2024-09-17T09:30:43Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Efficient Adaptive Human-Object Interaction Detection with
Concept-guided Memory [64.11870454160614]
We propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM)
ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm.
Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time.
arXiv Detail & Related papers (2023-09-07T13:10:06Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - HiCo: Hierarchical Contrastive Learning for Ultrasound Video Model
Pretraining [22.85475242323536]
Self-supervised ultrasound (US) video model pretraining can use a small amount of labeled data to achieve one of the most promising results on US diagnosis.
This work proposes a hierarchical contrastive learning (HiCo) method to improve the transferability for the US video model pretraining.
arXiv Detail & Related papers (2022-10-10T08:07:17Z) - Large Scale Time-Series Representation Learning via Simultaneous Low and
High Frequency Feature Bootstrapping [7.0064929761691745]
We propose a non-contrastive self-supervised learning approach efficiently captures low and high-frequency time-varying features.
Our method takes raw time series data as input and creates two different augmented views for two branches of the model.
To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets.
arXiv Detail & Related papers (2022-04-24T14:39:47Z) - ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search [94.90294600817215]
We propose a novel neural architecture search (NAS) method, termed ViPNAS, to search networks in both spatial and temporal levels for fast online video pose estimation.
In the spatial level, we carefully design the search space with five different dimensions including network depth, width, kernel size, group number, and attentions.
In the temporal level, we search from a series of temporal feature fusions to optimize the total accuracy and speed across multiple video frames.
arXiv Detail & Related papers (2021-05-21T06:36:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.