Online Learnable Keyframe Extraction in Videos and its Application with
Semantic Word Vector in Action Recognition
- URL: http://arxiv.org/abs/2009.12434v1
- Date: Fri, 25 Sep 2020 20:54:46 GMT
- Title: Online Learnable Keyframe Extraction in Videos and its Application with
Semantic Word Vector in Action Recognition
- Authors: G M Mashrur E Elahi, Yee-Hong Yang
- Abstract summary: We propose an online learnable module for extraction of key-shots in video.
This module can be used to select key-shots in video and thus can be applied to video summarization.
We also propose a plugin module to use the semantic word vector as input along withs and a novel train/test strategy for the classification models.
- Score: 5.849485167287474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video processing has become a popular research direction in computer vision
due to its various applications such as video summarization, action
recognition, etc. Recently, deep learning-based methods have achieved
impressive results in action recognition. However, these methods need to
process a full video sequence to recognize the action, even though most of
these frames are similar and non-essential to recognizing a particular action.
Additionally, these non-essential frames increase the computational cost and
can confuse a method in action recognition. Instead, the important frames
called keyframes not only are helpful in the recognition of an action but also
can reduce the processing time of each video sequence for classification or in
other applications, e.g. summarization. As well, current methods in video
processing have not yet been demonstrated in an online fashion.
Motivated by the above, we propose an online learnable module for keyframe
extraction. This module can be used to select key-shots in video and thus can
be applied to video summarization. The extracted keyframes can be used as input
to any deep learning-based classification model to recognize action. We also
propose a plugin module to use the semantic word vector as input along with
keyframes and a novel train/test strategy for the classification models. To our
best knowledge, this is the first time such an online module and train/test
strategy have been proposed.
The experimental results on many commonly used datasets in video
summarization and in action recognition have shown impressive results using the
proposed module.
Related papers
- Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition [84.31749632725929]
In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method.
Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains.
arXiv Detail & Related papers (2024-03-03T16:48:16Z) - Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z) - Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - An Integrated Approach for Video Captioning and Applications [2.064612766965483]
We design hybrid deep learning architectures to apply in long videos by captioning videos.
We argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications.
arXiv Detail & Related papers (2022-01-23T01:06:00Z) - Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts [89.06560404218028]
We introduce a new method for pre-training video action recognition models using queried web videos.
Instead of trying to filter out, we propose to convert the potential noises in these queried videos to useful supervision signals.
We show that SPL outperforms several existing pre-training strategies using pseudo-labels.
arXiv Detail & Related papers (2021-01-11T05:50:16Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z) - Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed
Videos [82.02074241700728]
In this paper, we present a prohibitive-level action recognition model that is trained with only video-frame labels.
Our method per person detectors have been trained on large image datasets within Multiple Instance Learning framework.
We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid.
arXiv Detail & Related papers (2020-07-21T10:45:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.