Key Frame Extraction with Attention Based Deep Neural Networks
- URL: http://arxiv.org/abs/2306.13176v1
- Date: Wed, 21 Jun 2023 15:09:37 GMT
- Title: Key Frame Extraction with Attention Based Deep Neural Networks
- Authors: Samed Arslan, Senem Tanberk
- Abstract summary: We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic keyframe detection from videos is an exercise in selecting scenes
that can best summarize the content for long videos. Providing a summary of the
video is an important task to facilitate quick browsing and content
summarization. The resulting photos are used for automated works (e.g.
summarizing security footage, detecting different scenes used in music clips)
in different industries. In addition, processing high-volume videos in advanced
machine learning methods also creates resource costs. Keyframes obtained; It
can be used as an input feature to the methods and models to be used. In this
study; We propose a deep learning-based approach for keyframe detection using a
deep auto-encoder model with an attention layer. The proposed method first
extracts the features from the video frames using the encoder part of the
autoencoder and applies segmentation using the k-means clustering algorithm to
group these features and similar frames together. Then, keyframes are selected
from each cluster by selecting the frames closest to the center of the
clusters. The method was evaluated on the TVSUM video dataset and achieved a
classification accuracy of 0.77, indicating a higher success rate than many
existing methods. The proposed method offers a promising solution for key frame
extraction in video analysis and can be applied to various applications such as
video summarization and video retrieval.
Related papers
- Scene Summarization: Clustering Scene Videos into Spatially Diverse
Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task.
It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene.
Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z) - Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content.
Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession.
We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - VRAG: Region Attention Graphs for Content-Based Video Retrieval [85.54923500208041]
Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods.
VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations.
We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
arXiv Detail & Related papers (2022-05-18T16:50:45Z) - Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification [1.2335698325757494]
We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks.
The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
arXiv Detail & Related papers (2022-03-25T05:45:18Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Classifying Video based on Automatic Content Detection Overview [12.556159953684023]
We summarized some state-of-the-art methods for multi-label video classification.
Our goal is first to experimentally research the current widely used architectures, and then to develop a method to deal with the sequential data of frames.
arXiv Detail & Related papers (2021-03-29T04:31:45Z) - Online Learnable Keyframe Extraction in Videos and its Application with
Semantic Word Vector in Action Recognition [5.849485167287474]
We propose an online learnable module for extraction of key-shots in video.
This module can be used to select key-shots in video and thus can be applied to video summarization.
We also propose a plugin module to use the semantic word vector as input along withs and a novel train/test strategy for the classification models.
arXiv Detail & Related papers (2020-09-25T20:54:46Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.