Related papers: T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval

URL: http://arxiv.org/abs/2408.11432v1
Date: Wed, 21 Aug 2024 08:40:45 GMT
Title: T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
Authors: Yili Li, Jing Yu, Keke Gai, Bang Liu, Gang Xiong, Qi Wu,
Abstract summary: We introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers. T2VIndexer aims to reduce retrieval time while maintaining high accuracy.
Score: 30.48217069475297
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, which are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30\%-50\% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://github.com/Lilidamowang/T2VIndexer-generativeSearch.

Related papers

Re-thinking Temporal Search for Long-Form Video Understanding [67.12801626407135]
Current temporal search methods only achieve 2.1% temporal F1 score on the Longvideobench subset. Inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding.
arXiv Detail & Related papers (2025-04-03T04:03:10Z)
TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding [24.52604124233087]
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM.
arXiv Detail & Related papers (2025-04-02T06:47:19Z)
Enhancing Subsequent Video Retrieval via Vision-Language Models (VLMs) [3.783822944546971]
Vision-language models (VLMs) excel in representation learning, but struggle with adaptive, time-sensitive video retrieval. This paper introduces a novel framework that combines vector similarity search with graph-based data structures.
arXiv Detail & Related papers (2025-03-21T01:11:14Z)
Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval [56.05621657583251]
Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. We introduce RTime, a novel temporal-emphasized video-text retrieval dataset. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours.
arXiv Detail & Related papers (2024-12-26T11:32:00Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z)
Towards Video Anomaly Retrieval from Video Anomaly Detection: New Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications. Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities. We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset. Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z)
Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval [55.088635195893325]
We propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ) HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos. Experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods.
arXiv Detail & Related papers (2022-02-07T18:04:10Z)
Multi-query Video Retrieval [44.32936301162444]
We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive. We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries. We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
arXiv Detail & Related papers (2022-01-10T20:44:46Z)
CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval [24.649068267308913]
Video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel model for effective moment localization and ranking. We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos.
arXiv Detail & Related papers (2021-09-21T08:07:27Z)
Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query. We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR. Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval [12.17757623963458]
This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval. We present the first hashing framework that infers the unseen future content of a currently playing video. Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task.
arXiv Detail & Related papers (2020-09-30T13:25:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.