Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using
a New Frame Selection Policy and Gating Mechanism
- URL: http://arxiv.org/abs/2301.07565v1
- Date: Wed, 18 Jan 2023 14:36:22 GMT
- Title: Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using
a New Frame Selection Policy and Gating Mechanism
- Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
- Abstract summary: Gated-ViGAT is an efficient approach for video event recognition.
It uses bottom-up (object) information, a new frame sampling policy and a gating mechanism.
Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach.
- Score: 8.395400675921515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, Gated-ViGAT, an efficient approach for video event
recognition, utilizing bottom-up (object) information, a new frame sampling
policy and a gating mechanism is proposed. Specifically, the frame sampling
policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of
graph attention networks (GATs), and a dissimilarity measure to select the most
salient and at the same time diverse frames representing the event in the
video. Additionally, the proposed gating mechanism fetches the selected frames
sequentially, and commits early-exiting when an adequately confident decision
is achieved. In this way, only a few frames are processed by the
computationally expensive branch of our network that is responsible for the
bottom-up information extraction. The experimental evaluation on two large,
publicly available video datasets (MiniKinetics, ActivityNet) demonstrates that
Gated-ViGAT provides a large computational complexity reduction in comparison
to our previous approach (ViGAT), while maintaining the excellent event
recognition and explainability performance. Gated-ViGAT source code is made
publicly available at https://github.com/bmezaris/Gated-ViGAT
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - ViGAT: Bottom-up event recognition and explanation in video using
factorized graph attention network [8.395400675921515]
ViGAT is a pure-attention bottom-up approach to derive object and frame features.
A head network is proposed to process these features for the task of event recognition and explanation in video.
A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets.
arXiv Detail & Related papers (2022-07-20T14:12:05Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - End-to-End Compressed Video Representation Learning for Generic Event
Boundary Detection [31.31508043234419]
We propose a new end-to-end compressed video representation learning for event boundary detection.
We first use the ConvNets to extract features of the I-frames in the GOPs.
After that, a light-weight spatial-channel compressed encoder is designed to compute the feature representations of the P-frames.
A temporal contrastive module is proposed to determine the event boundaries of video sequences.
arXiv Detail & Related papers (2022-03-29T08:27:48Z) - OCSampler: Compressing Videos to One Clip with Single-step Sampling [82.0417131211353]
We propose a framework named OCSampler to explore a compact yet effective video representation with one short clip.
Our basic motivation is that the efficient video recognition task lies in processing a whole sequence at once rather than picking up frames sequentially.
arXiv Detail & Related papers (2022-01-12T09:50:38Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.