Search-Map-Search: A Frame Selection Paradigm for Action Recognition
- URL: http://arxiv.org/abs/2304.10316v1
- Date: Thu, 20 Apr 2023 13:49:53 GMT
- Title: Search-Map-Search: A Frame Selection Paradigm for Action Recognition
- Authors: Mingjun Zhao, Yakun Yu, Xiaoli Wang, Lei Yang and Di Niu
- Abstract summary: Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
- Score: 21.395733318164393
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of deep learning in video understanding tasks, processing
every frame in a video is computationally expensive and often unnecessary in
real-time applications. Frame selection aims to extract the most informative
and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on
per-frame importance prediction, without considering interaction among frames,
or adopt reinforcement learning agents to find representative frames in
succession, which are costly to train and may lead to potential stability
issues. To overcome the limitations of existing methods, we propose a
Search-Map-Search learning paradigm which combines the advantages of heuristic
search and supervised learning to select the best combination of frames from a
video as one entity. By combining search with learning, the proposed method can
better capture frame interactions while incurring a low inference overhead.
Specifically, we first propose a hierarchical search method conducted on each
training video to search for the optimal combination of frames with the lowest
error on the downstream task. A feature mapping function is then learned to map
the frames of a video to the representation of its target optimal frame
combination. During inference, another search is performed on an unseen video
to select a combination of frames whose feature representation is close to the
projected feature representation. Extensive experiments based on several action
recognition benchmarks demonstrate that our frame selection method effectively
improves performance of action recognition models, and significantly
outperforms a number of competitive baselines.
Related papers
- An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval [1.6581184950812533]
We investigate the trade-offs in frame sampling methods for Video & Frame Retrieval using natural language questions.
Our study focuses on the storage and retrieval of image data (video frames) within a vector database required by Video RAG pattern.
arXiv Detail & Related papers (2024-07-22T11:44:08Z) - End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling [43.024232182899354]
We propose VidF4, a novel VideoQA framework equipped with tailored frame selection strategy for effective and efficient VideoQA.
We propose three frame-scoring mechanisms that consider both question relevance and inter-frame similarity to evaluate the importance of each frame for a given question on the video.
The experimental results across three widely adopted benchmarks demonstrate that our model consistently outperforms existing VideoQA methods.
arXiv Detail & Related papers (2024-07-21T04:09:37Z) - An Empirical Study of Frame Selection for Text-to-Video Retrieval [62.28080029331507]
Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text.
Existing methods typically select a subset of frames within a video to represent the video content for TVR.
In this paper, we make the first empirical study of frame selection for TVR.
arXiv Detail & Related papers (2023-11-01T05:03:48Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - SMART Frame Selection for Action Recognition [43.796505626453836]
We show that selecting good frames helps in action recognition performance even in the trimmed videos domain.
We propose a method that instead of selecting frames by considering one at a time, considers them jointly.
arXiv Detail & Related papers (2020-12-19T12:24:00Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.