Related papers: Hypergraph-based Multi-View Action Recognition using Event Cameras

Hypergraph-based Multi-View Action Recognition using Event Cameras

URL: http://arxiv.org/abs/2403.19316v1
Date: Thu, 28 Mar 2024 11:17:00 GMT
Title: Hypergraph-based Multi-View Action Recognition using Event Cameras
Authors: Yue Gao, Jiaxuan Lu, Siqi Li, Yipeng Li, Shaoyi Du,
Abstract summary: We introduce HyperMV, a multi-view event-based action recognition framework. We present the largest multi-view event-based action dataset $textTHUtextMV-EACTtext-50$, comprising 50 actions from 6 viewpoints. Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios.
Score: 20.965606424362726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action recognition from video data forms a cornerstone with wide-ranging applications. Single-view action recognition faces limitations due to its reliance on a single viewpoint. In contrast, multi-view approaches capture complementary information from various viewpoints for improved accuracy. Recently, event cameras have emerged as innovative bio-inspired sensors, leading to advancements in event-based action recognition. However, existing works predominantly focus on single-view scenarios, leaving a gap in multi-view event data exploitation, particularly in challenges like information deficit and semantic misalignment. To bridge this gap, we introduce HyperMV, a multi-view event-based action recognition framework. HyperMV converts discrete event data into frame-like representations and extracts view-related features using a shared convolutional network. By treating segments as vertices and constructing hyperedges using rule-based and KNN-based strategies, a multi-view hypergraph neural network that captures relationships across viewpoint and temporal features is established. The vertex attention hypergraph propagation is also introduced for enhanced feature fusion. To prompt research in this area, we present the largest multi-view event-based action dataset $\text{THU}^{\text{MV-EACT}}\text{-50}$, comprising 50 actions from 6 viewpoints, which surpasses existing datasets by over tenfold. Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios, and also exceeds the state-of-the-arts in frame-based multi-view action recognition.

Related papers

X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification [79.37768038337971]
We propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID.<n> Specifically, we first propose a Cross-modality Prototype Collaboration (CPC)<n>Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment.
arXiv Detail & Related papers (2025-11-22T07:57:15Z)
MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation [5.0923114224599555]
This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models. Our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches.
arXiv Detail & Related papers (2024-06-27T01:09:07Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method. We embed multi-scale complementary features from the same view position into a set of nodes. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z)
Two-level Data Augmentation for Calibrated Multi-view Detection [51.5746691103591]
We introduce a new multi-view data augmentation pipeline that preserves alignment among views. We also propose a second level of augmentation applied directly at the scene level. When combined with our simple multi-view detection model, our two-level augmentation pipeline outperforms all existing baselines.
arXiv Detail & Related papers (2022-10-19T17:55:13Z)
Multimodal Graph Learning for Deepfake Detection [10.077496841634135]
Existing deepfake detectors face several challenges in achieving robustness and generalization. We propose a novel framework, namely Multimodal Graph Learning (MGL), that leverages information from multiple modalities. Our proposed method aims to effectively identify and utilize distinguishing features for deepfake detection.
arXiv Detail & Related papers (2022-09-12T17:17:49Z)
ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network [8.395400675921515]
ViGAT is a pure-attention bottom-up approach to derive object and frame features. A head network is proposed to process these features for the task of event recognition and explanation in video. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets.
arXiv Detail & Related papers (2022-07-20T14:12:05Z)
Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event. We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event. Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z)
Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision. We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities. 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem. The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other. Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)
Generative Partial Multi-View Clustering [133.36721417531734]
We propose a generative partial multi-view clustering model, named as GP-MVC, to address the incomplete multi-view problem. First, multi-view encoder networks are trained to learn common low-dimensional representations, followed by a clustering layer to capture the consistent cluster structure across multiple views. Second, view-specific generative adversarial networks are developed to generate the missing data of one view conditioning on the shared representation given by other views.
arXiv Detail & Related papers (2020-03-29T17:48:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.