Related papers: Key Frame Extraction with Attention Based Deep Neural Networks

Key Frame Extraction with Attention Based Deep Neural Networks

URL: http://arxiv.org/abs/2306.13176v1
Date: Wed, 21 Jun 2023 15:09:37 GMT
Title: Key Frame Extraction with Attention Based Deep Neural Networks
Authors: Samed Arslan, Senem Tanberk
Abstract summary: We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer. The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together. The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic keyframe detection from videos is an exercise in selecting scenes that can best summarize the content for long videos. Providing a summary of the video is an important task to facilitate quick browsing and content summarization. The resulting photos are used for automated works (e.g. summarizing security footage, detecting different scenes used in music clips) in different industries. In addition, processing high-volume videos in advanced machine learning methods also creates resource costs. Keyframes obtained; It can be used as an input feature to the methods and models to be used. In this study; We propose a deep learning-based approach for keyframe detection using a deep auto-encoder model with an attention layer. The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means clustering algorithm to group these features and similar frames together. Then, keyframes are selected from each cluster by selecting the frames closest to the center of the clusters. The method was evaluated on the TVSUM video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods. The proposed method offers a promising solution for key frame extraction in video analysis and can be applied to various applications such as video summarization and video retrieval.

Related papers

Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis [0.0]
We present a unified, adaptive framework for automatic scene detection and selection.<n>It handles formats ranging from short-form media to long-form films, archival content, and surveillance footage.<n>The system is deployed in a commercial video analysis platform and has processed content from media, education, research, and security domains.
arXiv Detail & Related papers (2025-05-31T18:37:21Z)
Adaptive Keyframe Sampling for Long Video Understanding [75.7837692594814]
This paper presents a simple yet effective algorithm named Adaptive Keyframe Sampling (AKS) It inserts a plug-and-play module known as Adaptive Keyframe Sampling (AKS) which aims to maximize the useful information with a fixed number of video tokens. Experiments on two long video understanding benchmarks validate that AKS improves video QA accuracy upon selecting informative encounters.
arXiv Detail & Related papers (2025-02-28T17:46:29Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video. We propose VRS-HQ, an end-to-end video reasoning segmentation approach. Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Multimodal Contextualized Support for Enhancing Video Retrieval System [0.0]
We propose a system that integrates a novel retrieval pipeline that extracts multimodal data, and incorporate information from multiple frames within a video. The pipeline captures latent meanings, focusing on what can be inferred from the video clip, rather than just focusing on object detection in one single image.
arXiv Detail & Related papers (2024-12-10T15:20:23Z)
Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames [24.614476456145255]
We propose summarization as a new video-based scene understanding task. It aims to summarize a long video walkthrough of a scene into a small set of frames that are spatially diverse in the scene. Our solution is a two-stage self-supervised pipeline named SceneSum.
arXiv Detail & Related papers (2023-11-28T22:18:26Z)
Search-Map-Search: A Frame Selection Paradigm for Action Recognition [21.395733318164393]
Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, or adopt reinforcement learning agents to find representative frames in succession. We propose a Search-Map-Search learning paradigm which combines the advantages of search and supervised learning to select the best combination of frames from a video as one entity.
arXiv Detail & Related papers (2023-04-20T13:49:53Z)
Deep Unsupervised Key Frame Extraction for Efficient Video Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC) The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically. Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z)
VRAG: Region Attention Graphs for Content-Based Video Retrieval [85.54923500208041]
Region Attention Graph Networks (VRAG) improves the state-of-the-art video-level methods. VRAG represents videos at a finer granularity via region-level features and encodes video-temporal dynamics through region-level relations. We show that the performance gap between video-level and frame-level methods can be reduced by segmenting videos into shots and using shot embeddings for video retrieval.
arXiv Detail & Related papers (2022-05-18T16:50:45Z)
Semi-supervised and Deep learning Frameworks for Video Classification and Key-frame Identification [1.2335698325757494]
We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks. The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
arXiv Detail & Related papers (2022-03-25T05:45:18Z)
Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net) AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Classifying Video based on Automatic Content Detection Overview [12.556159953684023]
We summarized some state-of-the-art methods for multi-label video classification. Our goal is first to experimentally research the current widely used architectures, and then to develop a method to deal with the sequential data of frames.
arXiv Detail & Related papers (2021-03-29T04:31:45Z)
Online Learnable Keyframe Extraction in Videos and its Application with Semantic Word Vector in Action Recognition [5.849485167287474]
We propose an online learnable module for extraction of key-shots in video. This module can be used to select key-shots in video and thus can be applied to video summarization. We also propose a plugin module to use the semantic word vector as input along withs and a novel train/test strategy for the classification models.
arXiv Detail & Related papers (2020-09-25T20:54:46Z)
Temporal Context Aggregation for Video Retrieval with Contrastive Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features. The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
Convolutional Hierarchical Attention Network for Query-Focused Video Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs. We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module. In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.