Multi-query Video Retrieval
- URL: http://arxiv.org/abs/2201.03639v1
- Date: Mon, 10 Jan 2022 20:44:46 GMT
- Title: Multi-query Video Retrieval
- Authors: Zeyu Wang, Yu Wu, Karthik Narasimhan, Olga Russakovsky
- Abstract summary: We focus on the less-studied setting of multi-query video retrieval, where multiple queries are provided to the model for searching over the video archive.
We propose several new methods for leveraging multiple queries at training time to improve over simply combining similarity outputs of multiple queries.
We believe further modeling efforts will bring new insights to this direction and spark new systems that perform better in real-world video retrieval applications.
- Score: 44.32936301162444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieving target videos based on text descriptions is a task of great
practical value and has received increasing attention over the past few years.
In this paper, we focus on the less-studied setting of multi-query video
retrieval, where multiple queries are provided to the model for searching over
the video archive. We first show that the multi-query retrieval task is more
pragmatic and representative of real-world use cases and better evaluates
retrieval capabilities of current models, thereby deserving of further
investigation alongside the more prevalent single-query retrieval setup. We
then propose several new methods for leveraging multiple queries at training
time to improve over simply combining similarity outputs of multiple queries
from regular single-query trained models. Our models consistently outperform
several competitive baselines over three different datasets. For instance,
Recall@1 can be improved by 4.7 points on MSR-VTT, 4.1 points on MSVD and 11.7
points on VATEX over a strong baseline built on the state-of-the-art CLIP4Clip
model. We believe further modeling efforts will bring new insights to this
direction and spark new systems that perform better in real-world video
retrieval applications. Code is available at
https://github.com/princetonvisualai/MQVR.
Related papers
- T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval [30.48217069475297]
We introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers.
T2VIndexer aims to reduce retrieval time while maintaining high accuracy.
arXiv Detail & Related papers (2024-08-21T08:40:45Z) - Many-Shot In-Context Learning in Multimodal Foundation Models [4.772535803521769]
Large language models are effective at few-shot in-context learning (ICL)
Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows.
We benchmark GPT-4o and Gemini 1.5 Pro across 14 datasets spanning multiple domains.
arXiv Detail & Related papers (2024-05-16T04:02:43Z) - Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - DeepQAMVS: Query-Aware Hierarchical Pointer Networks for Multi-Video
Summarization [127.16984421969529]
We introduce a novel Query-Aware Hierarchical Pointer Network for Multi-Video Summarization, termed DeepQAMVS.
DeepQAMVS is trained with reinforcement learning, incorporating rewards that capture representativeness, diversity, query-adaptability and temporal coherence.
We achieve state-of-the-art results on the MVS1K dataset, with inference time scaling linearly with the number of input video frames.
arXiv Detail & Related papers (2021-05-13T17:33:26Z) - MDMMT: Multidomain Multimodal Transformer for Video Retrieval [63.872634680339644]
We present a new state-of-the-art on the text to video retrieval task on MSRVTT and LSMDC benchmarks.
We show that training on different datasets can improve test results of each other.
arXiv Detail & Related papers (2021-03-19T09:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.