Related papers: SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition

URL: http://arxiv.org/abs/2601.17391v1
Date: Sat, 24 Jan 2026 09:24:42 GMT
Title: SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Authors: Rui Fan, Weidong Hao,
Abstract summary: Event action recognition (EAR) offers privacy-protecting and efficiency advantages where temporal motion dynamics is of great importance.<n>This paper reexamines the key SMVRL design stages for EAR and proposes a principled multi-view representation through translation-invariant dense conversion of sparse events.<n>We show that Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 30.2% lower computations, establishing our framework as a novel and powerful EAR paradigm.
Score: 4.322175390073132
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting H-W-T events along spatial axis H and W, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0%, +10.7%, and +10.2% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 35.7% lower computations, establishing our framework as a novel and powerful EAR paradigm.

Related papers

HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating [0.571097144710995]
Event-based Action Recognition (EAR) has attracted significant attention due to high temporal resolution and high dynamic range of event cameras.<n>Existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns.
arXiv Detail & Related papers (2026-02-04T03:42:42Z)
USV: Unified Sparsification for Accelerating Video Diffusion Models [11.011602744993942]
Unified Sparsification for Video diffusion models is an end-to-end trainable framework.<n>It orchestrates sparsification across both the model's internal computation and its sampling process.<n>It achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity.
arXiv Detail & Related papers (2025-12-05T14:40:06Z)
Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
Efficient Spatial-Temporal Modeling for Real-Time Video Analysis: A Unified Framework for Action Recognition and Object Tracking [0.0]
Real-time video analysis remains a challenging problem in computer vision.<n>We present a unified framework that leverages advanced spatial-temporal modeling techniques for simultaneous action recognition and object tracking.<n>Our method achieves state-of-the-art performance on standard benchmarks while maintaining real-time inference speeds.
arXiv Detail & Related papers (2025-07-30T06:49:11Z)
Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios [3.437245452211197]
This paper introduces a Multi-scale Spectral Attention Module (MSAM) that enhances spectral feature extraction.<n>By integrating MSAM into UNet's skip connections (UNet-SC), our proposed UNet-MSAM achieves significant improvements in semantic segmentation performance.
arXiv Detail & Related papers (2025-06-23T14:24:20Z)
CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z)
Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling [0.0]
Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction.<n>Existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise.<n>We introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience.<n>Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module.
arXiv Detail & Related papers (2025-04-14T07:57:22Z)
ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.<n>The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.<n>To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z)
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR) Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z)
Recurrent Vision Transformers for Object Detection with Event Cameras [62.27246562304705]
We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection. Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
arXiv Detail & Related papers (2022-12-11T20:28:59Z)
HALSIE: Hybrid Approach to Learning Segmentation by Simultaneously Exploiting Image and Event Modalities [6.543272301133159]
Event cameras detect changes in per-pixel intensity to generate asynchronous event streams. They offer great potential for accurate semantic map retrieval in real-time autonomous systems. Existing implementations for event segmentation suffer from sub-based performance. We propose hybrid end-to-end learning framework HALSIE to reduce inference cost by up to $20times$ versus art.
arXiv Detail & Related papers (2022-11-19T17:09:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.