ViGAT: Bottom-up event recognition and explanation in video using
factorized graph attention network
- URL: http://arxiv.org/abs/2207.09927v1
- Date: Wed, 20 Jul 2022 14:12:05 GMT
- Title: ViGAT: Bottom-up event recognition and explanation in video using
factorized graph attention network
- Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
- Abstract summary: ViGAT is a pure-attention bottom-up approach to derive object and frame features.
A head network is proposed to process these features for the task of event recognition and explanation in video.
A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets.
- Score: 8.395400675921515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper a pure-attention bottom-up approach, called ViGAT, that
utilizes an object detector together with a Vision Transformer (ViT) backbone
network to derive object and frame features, and a head network to process
these features for the task of event recognition and explanation in video, is
proposed. The ViGAT head consists of graph attention network (GAT) blocks
factorized along the spatial and temporal dimensions in order to capture
effectively both local and long-term dependencies between objects or frames.
Moreover, using the weighted in-degrees (WiDs) derived from the adjacency
matrices at the various GAT blocks, we show that the proposed architecture can
identify the most salient objects and frames that explain the decision of the
network. A comprehensive evaluation study is performed, demonstrating that the
proposed approach provides state-of-the-art results on three large, publicly
available video datasets (FCVID, Mini-Kinetics, ActivityNet).
Related papers
- Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using
a New Frame Selection Policy and Gating Mechanism [8.395400675921515]
Gated-ViGAT is an efficient approach for video event recognition.
It uses bottom-up (object) information, a new frame sampling policy and a gating mechanism.
Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach.
arXiv Detail & Related papers (2023-01-18T14:36:22Z) - Network Comparison Study of Deep Activation Feature Discriminability
with Novel Objects [0.5076419064097732]
State-of-the-art computer visions algorithms have incorporated Deep Neural Networks (DNN) in feature extracting roles, creating Deep Convolutional Activation Features (DeCAF)
This study analyzes the general discriminability of novel object visual appearances encoded into the DeCAF space of six of the leading visual recognition DNN architectures.
arXiv Detail & Related papers (2022-02-08T07:40:53Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - PC-RGNN: Point Cloud Completion and Graph Neural Network for 3D Object
Detection [57.49788100647103]
LiDAR-based 3D object detection is an important task for autonomous driving.
Current approaches suffer from sparse and partial point clouds of distant and occluded objects.
In this paper, we propose a novel two-stage approach, namely PC-RGNN, dealing with such challenges by two specific solutions.
arXiv Detail & Related papers (2020-12-18T18:06:43Z) - Feature Flow: In-network Feature Flow Estimation for Video Object
Detection [56.80974623192569]
Optical flow is widely used in computer vision tasks to provide pixel-level motion information.
A common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset.
We propose a novel network (IFF-Net) with an textbfIn-network textbfFeature textbfFlow estimation module for video object detection.
arXiv Detail & Related papers (2020-09-21T07:55:50Z) - Visual Concept Reasoning Networks [93.99840807973546]
A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks.
We propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts.
Our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
arXiv Detail & Related papers (2020-08-26T20:02:40Z) - Learning Discriminative Feature with CRF for Unsupervised Video Object
Segmentation [34.1031534327244]
We introduce discriminative feature network (DFNet) to address the unsupervised video object segmentation task.
DFNet outperforms state-of-the-art methods by a large margin with a mean IoU score of 83.4%.
DFNet is also applied to the image object co-segmentation task.
arXiv Detail & Related papers (2020-08-04T01:53:56Z) - Ventral-Dorsal Neural Networks: Object Detection via Selective Attention [51.79577908317031]
We propose a new framework called Ventral-Dorsal Networks (VDNets)
Inspired by the structure of the human visual system, we propose the integration of a "Ventral Network" and a "Dorsal Network"
Our experimental results reveal that the proposed method outperforms state-of-the-art object detection approaches.
arXiv Detail & Related papers (2020-05-15T23:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.