An Empirical Study of Frame Selection for Text-to-Video Retrieval
- URL: http://arxiv.org/abs/2311.00298v1
- Date: Wed, 1 Nov 2023 05:03:48 GMT
- Title: An Empirical Study of Frame Selection for Text-to-Video Retrieval
- Authors: Mengxia Wu, Min Cao, Yang Bai, Ziyin Zeng, Chen Chen, Liqiang Nie, Min
Zhang
- Abstract summary: Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text.
Existing methods typically select a subset of frames within a video to represent the video content for TVR.
In this paper, we make the first empirical study of frame selection for TVR.
- Score: 62.28080029331507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-video retrieval (TVR) aims to find the most relevant video in a large
video gallery given a query text. The intricate and abundant context of the
video challenges the performance and efficiency of TVR. To handle the
serialized video contexts, existing methods typically select a subset of frames
within a video to represent the video content for TVR. How to select the most
representative frames is a crucial issue, whereby the selected frames are
required to not only retain the semantic information of the video but also
promote retrieval efficiency by excluding temporally redundant frames. In this
paper, we make the first empirical study of frame selection for TVR. We
systemically classify existing frame selection methods into text-free and
text-guided ones, under which we detailedly analyze six different frame
selections in terms of effectiveness and efficiency. Among them, two frame
selections are first developed in this paper. According to the comprehensive
analysis on multiple TVR benchmarks, we empirically conclude that the TVR with
proper frame selections can significantly improve the retrieval efficiency
without sacrificing the retrieval performance.
Related papers
- An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval [1.6581184950812533]
We investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions.
Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern.
arXiv Detail & Related papers (2024-07-22T11:44:08Z) - End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling [43.024232182899354]
We propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA.
We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video.
The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods.
arXiv Detail & Related papers (2024-07-21T04:09:37Z) - RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter [77.0205013713008]
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries.
To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained vision models.
We propose a sparse-andcorrelated AdaPter (RAP) to fine-tune the pre-trained model with a few parameterized layers.
arXiv Detail & Related papers (2024-05-29T19:23:53Z) - Fine-grained Text-Video Retrieval with Frozen Image Encoders [10.757101644990273]
We propose CrossTVR, a two-stage text-video retrieval architecture.
In the first stage, we leverage existing TVR methods with cosine similarity network for efficient text/video candidate selection.
In the second stage, we propose a novel decoupled video text cross attention module to capture fine-grained multimodal information in spatial and temporal dimensions.
arXiv Detail & Related papers (2023-07-14T02:57:00Z) - Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - SMART Frame Selection for Action Recognition [43.796505626453836]
We show that selecting good frames helps in action recognition performance even in the trimmed videos domain.
We propose a method that instead of selecting frames by considering one at a time, considers them jointly.
arXiv Detail & Related papers (2020-12-19T12:24:00Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.