Event Voxel Set Transformer for Spatiotemporal Representation Learning
on Event Streams
- URL: http://arxiv.org/abs/2303.03856v2
- Date: Thu, 18 May 2023 07:48:25 GMT
- Title: Event Voxel Set Transformer for Spatiotemporal Representation Learning
on Event Streams
- Authors: Bochen Xie and Yongjian Deng and Zhanpeng Shao and Hai Liu and
Qingsong Xu and Youfu Li
- Abstract summary: Event cameras are neuromorphic vision sensors representing visual information as sparse and asynchronous event streams.
We develop a novel attention-aware model named Event Voxel Set Transformer (EVSTr) for representation learning on event streams.
We evaluate the proposed model on two event-based recognition tasks: object classification and action recognition.
- Score: 23.872611710730865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras are neuromorphic vision sensors representing visual information
as sparse and asynchronous event streams. Most state-of-the-art event-based
methods project events into dense frames and process them with conventional
learning models. However, these approaches sacrifice the sparsity and high
temporal resolution of event data, resulting in a large model size and high
computational complexity. To fit the sparse nature of events and sufficiently
explore the relationship between them, we develop a novel attention-aware model
named Event Voxel Set Transformer (EVSTr) for spatiotemporal representation
learning on event streams. It first converts the event stream into voxel sets
and then hierarchically aggregates voxel features to obtain robust
representations. The core of EVSTr is an event voxel transformer encoder to
extract discriminative spatiotemporal features, which consists of two
well-designed components, including a Multi-Scale Neighbor Embedding Layer
(MNEL) for local information aggregation and a Voxel Self-Attention Layer
(VSAL) for global feature interactions. Enabling the network to incorporate a
long-range temporal structure, we introduce a segment modeling strategy to
learn motion patterns from a sequence of segmented voxel sets. We evaluate the
proposed model on two event-based recognition tasks: object classification and
action recognition. Comprehensive experiments show that EVSTr achieves
state-of-the-art performance while maintaining low model complexity.
Additionally, we present a new dataset (NeuroHAR) recorded in challenging
visual scenarios to complement the lack of real-world event-based datasets for
action recognition.
Related papers
- Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - E2HQV: High-Quality Video Generation from Event Camera via
Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range.
It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization.
We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z) - Implicit Event-RGBD Neural SLAM [54.74363487009845]
Implicit neural SLAM has achieved remarkable progress recently.
Existing methods face significant challenges in non-ideal scenarios.
We propose EN-SLAM, the first event-RGBD implicit neural SLAM framework.
arXiv Detail & Related papers (2023-11-18T08:48:58Z) - GET: Group Event Transformer for Event-Based Vision [82.312736707534]
Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention.
We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET)
GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
arXiv Detail & Related papers (2023-10-04T08:02:33Z) - Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification [6.550582412924754]
This paper proposes a novel dual-stream framework for event representation, extraction, and fusion.
Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
arXiv Detail & Related papers (2023-08-23T06:07:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - EV-VGCNN: A Voxel Graph CNN for Event-based Object Classification [18.154951807178943]
Event cameras report sparse intensity changes and hold noticeable advantages of low power consumption, high dynamic range, and high response speed for visual perception and understanding on portable devices.
Event-based learning methods have recently achieved massive success on object recognition by integrating events into dense frame-based representations to apply traditional 2D learning algorithms.
These approaches introduce much redundant information during the sparse-to-dense conversion and necessitate models with heavy-weight and large capacities, limiting the potential of event cameras on real-life applications.
arXiv Detail & Related papers (2021-06-01T04:07:03Z) - Superevents: Towards Native Semantic Segmentation for Event-based
Cameras [13.099264910430986]
Most successful computer vision models transform low-level features, such as Gabor filter responses, into richer representations of intermediate or mid-level complexity for downstream visual tasks.
We present a novel method that employs lifetime augmentation for obtaining an event stream representation that is fed to a fully convolutional network to extract superevents.
arXiv Detail & Related papers (2021-05-13T05:49:41Z) - A Differentiable Recurrent Surface for Asynchronous Event-Based Data [19.605628378366667]
We propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces.
Compared to existing reconstruction approaches, our learned event-surface shows good flexibility and on optical flow estimation.
It improves the state-of-the-art of event-based object classification on the N-Cars dataset.
arXiv Detail & Related papers (2020-01-10T14:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.