Related papers: Semi-supervised and Deep learning Frameworks for Video Classification and Key-frame Identification

Semi-supervised and Deep learning Frameworks for Video Classification and Key-frame Identification

URL: http://arxiv.org/abs/2203.13459v1
Date: Fri, 25 Mar 2022 05:45:18 GMT
Title: Semi-supervised and Deep learning Frameworks for Video Classification and Key-frame Identification
Authors: Sohini Roychowdhury
Abstract summary: We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks. The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
Score: 1.2335698325757494
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automating video-based data and machine learning pipelines poses several challenges including metadata generation for efficient storage and retrieval and isolation of key-frames for scene understanding tasks. In this work, we present two semi-supervised approaches that automate this process of manual frame sifting in video streams by automatically classifying scenes for content and filtering frames for fine-tuning scene understanding tasks. The first rule-based method starts from a pre-trained object detector and it assigns scene type, uncertainty and lighting categories to each frame based on probability distributions of foreground objects. Next, frames with the highest uncertainty and structural dissimilarity are isolated as key-frames. The second method relies on the simCLR model for frame encoding followed by label-spreading from 20% of frame samples to label the remaining frames for scene and lighting categories. Also, clustering the video frames in the encoded feature space further isolates key-frames at cluster boundaries. The proposed methods achieve 64-93% accuracy for automated scene categorization for outdoor image videos from public domain datasets of JAAD and KITTI. Also, less than 10% of all input frames can be filtered as key-frames that can then be sent for annotation and fine tuning of machine vision algorithms. Thus, the proposed framework can be scaled to additional video data streams for automated training of perception-driven systems with minimal training images.

Related papers

FRAME: Pre-Training Video Feature Representations via Anticipation and Memory [55.046881477209695]
FRAME is a self-supervised video frame encoder tailored for dense video understanding.<n>It learns to predict current and future DINO patch features from past and present RGB frames.<n>It consistently outperforms image encoders and existing self-supervised video models.
arXiv Detail & Related papers (2025-06-05T19:44:47Z)
Scene Detection Policies and Keyframe Extraction Strategies for Large-Scale Video Analysis [0.0]
We present a unified, adaptive framework for automatic scene detection and selection.<n>It handles formats ranging from short-form media to long-form films, archival content, and surveillance footage.<n>The system is deployed in a commercial video analysis platform and has processed content from media, education, research, and security domains.
arXiv Detail & Related papers (2025-05-31T18:37:21Z)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video. We propose VRS-HQ, an end-to-end video reasoning segmentation approach. Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z)
Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer. The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together. The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z)
Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes. Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset. Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z)
MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation. Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z)
OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip. Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z)
Video-Data Pipelines for Machine Learning Applications [0.9594432031144714]
The proposed framework can be scaled to additional video-sequence data sets for ML versioned deployments. We analyze the performance of the proposed video-data pipeline for versioned deployment and monitoring for object detection algorithms.
arXiv Detail & Related papers (2021-10-15T20:28:56Z)
A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z)
No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames. We first cluster all frame activations along the temporal dimension. We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z)
Temporal Context Aggregation for Video Retrieval with Contrastive Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features. The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)
SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. We propose a unified system called SF-Net to make use of such single-frame supervision. SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
HMM-guided frame querying for bandwidth-constrained video search [16.956238550063365]
We design an agent to search for frames of interest in video stored on a remote server, under bandwidth constraints. Using a convolutional neural network to score individual frames and a hidden Markov model to propagate predictions across frames, our agent accurately identifies temporal regions of interest based on sparse, strategically sampled frames. On a subset of the ImageNet-VID dataset, we demonstrate that using a hidden Markov model to interpolate between frame scores allows requests of 98% of frames to be omitted, without compromising frame-of-interest classification accuracy.
arXiv Detail & Related papers (2019-12-31T19:54:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.