Related papers: Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams

URL: http://arxiv.org/abs/2303.03856v2
Date: Thu, 18 May 2023 07:48:25 GMT
Title: Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams
Authors: Bochen Xie and Yongjian Deng and Zhanpeng Shao and Hai Liu and Qingsong Xu and Youfu Li
Abstract summary: Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. We develop a novel attention-aware model named Event Voxel Set Transformer (EVSTr) for representation learning on event streams. We evaluate the proposed model on two event-based recognition tasks: object classification and action recognition.
Score: 23.872611710730865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams. Most state-of-the-art event-based methods project events into dense frames and process them with conventional learning models. However, these approaches sacrifice the sparsity and high temporal resolution of event data, resulting in a large model size and high computational complexity. To fit the sparse nature of events and sufficiently explore the relationship between them, we develop a novel attention-aware model named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder to extract discriminative spatiotemporal features, which consists of two well-designed components, including a Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and a Voxel Self-Attention Layer (VSAL) for global feature interactions. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy to learn motion patterns from a sequence of segmented voxel sets. We evaluate the proposed model on two event-based recognition tasks: object classification and action recognition. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity. Additionally, we present a new dataset (NeuroHAR) recorded in challenging visual scenarios to complement the lack of real-world event-based datasets for action recognition.

Related papers

Hybrid Spiking Vision Transformer for Object Detection with Event Cameras [19.967565219584056]
Spiking Neural Networks (SNNs) have emerged as a promising approach, offering low energy consumption and rich dynamics.<n>This study proposes a novel hybrid Transformer (HsVT) model to enhance the performance of event-based object detection.<n> Experimental results demonstrate that HsVT achieves significant performance improvements in event detection with fewer parameters.
arXiv Detail & Related papers (2025-05-12T16:19:20Z)
Event Stream-based Visual Object Tracking: HDETrack V2 and A High-Definition Benchmark [36.9654606035663]
We introduce a novel hierarchical knowledge distillation strategy to guide the learning of the student Transformer network. We adapt the network model to specific target objects during testing via a newly proposed test-time tuning strategy. We propose EventVOT, the first large-scale high-resolution event-based tracking dataset.
arXiv Detail & Related papers (2025-02-08T13:59:52Z)
A dynamic vision sensor object recognition model based on trainable event-driven convolution and spiking attention mechanism [9.745798797360886]
Spiking Neural Networks (SNNs) are well-suited for processing event streams from Dynamic Visual Sensors (DVSs) To extract features from DVS objects, SNNs commonly use event-driven convolution with fixed kernel parameters. We propose a DVS object recognition model that utilizes a trainable event-driven convolution and a spiking attention mechanism.
arXiv Detail & Related papers (2024-09-19T12:01:05Z)
Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z)
GET: Group Event Transformer for Event-Based Vision [82.312736707534]
Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention. We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET) GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
arXiv Detail & Related papers (2023-10-04T08:02:33Z)
EventTransAct: A video transformer-based framework for Event-camera based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos. In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame. In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z)
Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification [6.550582412924754]
This paper proposes a novel dual-stream framework for event representation, extraction, and fusion. Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
arXiv Detail & Related papers (2023-08-23T06:07:56Z)
Dual Memory Aggregation Network for Event-Based Object Detection with Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z)
A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc. Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
EV-VGCNN: A Voxel Graph CNN for Event-based Object Classification [18.154951807178943]
Event cameras report sparse intensity changes and hold noticeable advantages of low power consumption, high dynamic range, and high response speed for visual perception and understanding on portable devices. Event-based learning methods have recently achieved massive success on object recognition by integrating events into dense frame-based representations to apply traditional 2D learning algorithms. These approaches introduce much redundant information during the sparse-to-dense conversion and necessitate models with heavy-weight and large capacities, limiting the potential of event cameras on real-life applications.
arXiv Detail & Related papers (2021-06-01T04:07:03Z)
Superevents: Towards Native Semantic Segmentation for Event-based Cameras [13.099264910430986]
Most successful computer vision models transform low-level features, such as Gabor filter responses, into richer representations of intermediate or mid-level complexity for downstream visual tasks. We present a novel method that employs lifetime augmentation for obtaining an event stream representation that is fed to a fully convolutional network to extract superevents.
arXiv Detail & Related papers (2021-05-13T05:49:41Z)
Event-based Asynchronous Sparse Convolutional Networks [54.094244806123235]
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events" We present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output. We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks.
arXiv Detail & Related papers (2020-03-20T08:39:49Z)
A Differentiable Recurrent Surface for Asynchronous Event-Based Data [19.605628378366667]
We propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces. Compared to existing reconstruction approaches, our learned event-surface shows good flexibility and on optical flow estimation. It improves the state-of-the-art of event-based object classification on the N-Cars dataset.
arXiv Detail & Related papers (2020-01-10T14:09:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.