CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved
Self-Supervised Video Hashing
- URL: http://arxiv.org/abs/2310.18926v1
- Date: Sun, 29 Oct 2023 07:36:11 GMT
- Title: CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved
Self-Supervised Video Hashing
- Authors: Rukai Wei, Yu Liu, Jingkuan Song, Heng Cui, Yanzhao Xie, and Ke Zhou
- Abstract summary: Learn accurate hash for video retrieval can be challenging due to high local redundancy and complex global video frames.
Our proposed Contrastive Hash-temporal Information (CHAIN) outperforms state-of-the-art self-supervised video hashing methods on four video benchmark datasets.
- Score: 45.216750448864275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compressing videos into binary codes can improve retrieval speed and reduce
storage overhead. However, learning accurate hash codes for video retrieval can
be challenging due to high local redundancy and complex global dependencies
between video frames, especially in the absence of labels. Existing
self-supervised video hashing methods have been effective in designing
expressive temporal encoders, but have not fully utilized the temporal dynamics
and spatial appearance of videos due to less challenging and unreliable
learning tasks. To address these challenges, we begin by utilizing the
contrastive learning task to capture global spatio-temporal information of
videos for hashing. With the aid of our designed augmentation strategies, which
focus on spatial and temporal variations to create positive pairs, the learning
framework can generate hash codes that are invariant to motion, scale, and
viewpoint. Furthermore, we incorporate two collaborative learning tasks, i.e.,
frame order verification and scene change regularization, to capture local
spatio-temporal details within video frames, thereby enhancing the perception
of temporal structure and the modeling of spatio-temporal relationships. Our
proposed Contrastive Hashing with Global-Local Spatio-temporal Information
(CHAIN) outperforms state-of-the-art self-supervised video hashing methods on
four video benchmark datasets. Our codes will be released.
Related papers
- Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval [67.52910255064762]
We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
arXiv Detail & Related papers (2023-10-12T03:21:12Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Controllable Augmentations for Video Representation Learning [34.79719112810065]
We propose a framework to jointly utilize local clips and global videos to learn from detailed region-level correspondence as well as minimization general long-term temporal relations.
Our framework is superior on three video benchmarks in action recognition and video retrieval, capturing more accurate temporal dynamics.
arXiv Detail & Related papers (2022-03-30T19:34:32Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.