Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification
- URL: http://arxiv.org/abs/2203.13459v1
- Date: Fri, 25 Mar 2022 05:45:18 GMT
- Title: Semi-supervised and Deep learning Frameworks for Video Classification
and Key-frame Identification
- Authors: Sohini Roychowdhury
- Abstract summary: We present two semi-supervised approaches that automatically classify scenes for content and filter frames for scene understanding tasks.
The proposed framework can be scaled to additional video data streams for automated training of perception-driven systems.
- Score: 1.2335698325757494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating video-based data and machine learning pipelines poses several
challenges including metadata generation for efficient storage and retrieval
and isolation of key-frames for scene understanding tasks. In this work, we
present two semi-supervised approaches that automate this process of manual
frame sifting in video streams by automatically classifying scenes for content
and filtering frames for fine-tuning scene understanding tasks. The first
rule-based method starts from a pre-trained object detector and it assigns
scene type, uncertainty and lighting categories to each frame based on
probability distributions of foreground objects. Next, frames with the highest
uncertainty and structural dissimilarity are isolated as key-frames. The second
method relies on the simCLR model for frame encoding followed by
label-spreading from 20% of frame samples to label the remaining frames for
scene and lighting categories. Also, clustering the video frames in the encoded
feature space further isolates key-frames at cluster boundaries. The proposed
methods achieve 64-93% accuracy for automated scene categorization for outdoor
image videos from public domain datasets of JAAD and KITTI. Also, less than 10%
of all input frames can be filtered as key-frames that can then be sent for
annotation and fine tuning of machine vision algorithms. Thus, the proposed
framework can be scaled to additional video data streams for automated training
of perception-driven systems with minimal training images.
Related papers
- Key Frame Extraction with Attention Based Deep Neural Networks [0.0]
We propose a deep learning-based approach for detection using a deep auto-encoder model with an attention layer.
The proposed method first extracts the features from the video frames using the encoder part of the autoencoder and applies segmentation using the k-means algorithm to group these features and similar frames together.
The method was evaluated on the TVSUM clustering video dataset and achieved a classification accuracy of 0.77, indicating a higher success rate than many existing methods.
arXiv Detail & Related papers (2023-06-21T15:09:37Z) - Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes.
Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset.
Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Video-Data Pipelines for Machine Learning Applications [0.9594432031144714]
The proposed framework can be scaled to additional video-sequence data sets for ML versioned deployments.
We analyze the performance of the proposed video-data pipeline for versioned deployment and monitoring for object detection algorithms.
arXiv Detail & Related papers (2021-10-15T20:28:56Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - No frame left behind: Full Video Action Recognition [26.37329995193377]
We propose full video action recognition and consider all video frames.
We first cluster all frame activations along the temporal dimension.
We then temporally aggregate the frames in the clusters into a smaller number of representations.
arXiv Detail & Related papers (2021-03-29T07:44:28Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z) - HMM-guided frame querying for bandwidth-constrained video search [16.956238550063365]
We design an agent to search for frames of interest in video stored on a remote server, under bandwidth constraints.
Using a convolutional neural network to score individual frames and a hidden Markov model to propagate predictions across frames, our agent accurately identifies temporal regions of interest based on sparse, strategically sampled frames.
On a subset of the ImageNet-VID dataset, we demonstrate that using a hidden Markov model to interpolate between frame scores allows requests of 98% of frames to be omitted, without compromising frame-of-interest classification accuracy.
arXiv Detail & Related papers (2019-12-31T19:54:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.