Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
- URL: http://arxiv.org/abs/2202.03384v2
- Date: Thu, 10 Feb 2022 01:30:08 GMT
- Title: Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval
- Authors: Jinpeng Wang, Bin Chen, Dongliang Liao, Ziyun Zeng, Gongfu Li, Shu-Tao
Xia, Jin Xu
- Abstract summary: We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ)
HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos.
Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
- Score: 55.088635195893325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the recent boom of video-based social platforms (e.g., YouTube and
TikTok), video retrieval using sentence queries has become an important demand
and attracts increasing research attention. Despite the decent performance,
existing text-video retrieval models in vision and language communities are
impractical for large-scale Web search because they adopt brute-force search
based on high-dimensional embeddings. To improve efficiency, Web search engines
widely apply vector compression libraries (e.g., FAISS) to post-process the
learned embeddings. Unfortunately, separate compression from feature encoding
degrades the robustness of representations and incurs performance decay. To
pursue a better balance between performance and efficiency, we propose the
first quantized representation learning method for cross-view video retrieval,
namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both
coarse-grained and fine-grained quantizations with transformers, which provide
complementary understandings for texts and videos and preserve comprehensive
semantic information. By performing Asymmetric-Quantized Contrastive Learning
(AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and
multiple fine-grained levels. This hybrid-grained learning strategy serves as
strong supervision on the cross-view video quantization model, where
contrastive learning at different levels can be mutually promoted. Extensive
experiments on three Web video benchmark datasets demonstrate that HCQ achieves
competitive performance with state-of-the-art non-compressed retrieval methods
while showing high efficiency in storage and computation. Code and
configurations are available at https://github.com/gimpong/WWW22-HCQ.
Related papers
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Contrastive Masked Autoencoders for Self-Supervised Video Hashing [54.636976693527636]
Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision.
We propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding.
arXiv Detail & Related papers (2022-11-21T06:48:14Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Self-supervised Video Retrieval Transformer Network [10.456881328982586]
We propose SVRTN, which applies self-supervised training to learn video representation from unlabeled data.
It exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity.
It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners.
arXiv Detail & Related papers (2021-04-16T09:43:45Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.