GET: Group Event Transformer for Event-Based Vision
- URL: http://arxiv.org/abs/2310.02642v1
- Date: Wed, 4 Oct 2023 08:02:33 GMT
- Title: GET: Group Event Transformer for Event-Based Vision
- Authors: Yansong Peng and Yueyi Zhang and Zhiwei Xiong and Xiaoyan Sun and Feng
Wu
- Abstract summary: Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention.
We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET)
GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
- Score: 82.312736707534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Event cameras are a type of novel neuromorphic sen-sor that has been gaining
increasing attention. Existing event-based backbones mainly rely on image-based
designs to extract spatial information within the image transformed from
events, overlooking important event properties like time and polarity. To
address this issue, we propose a novel Group-based vision Transformer backbone
for Event-based vision, called Group Event Transformer (GET), which de-couples
temporal-polarity information from spatial infor-mation throughout the feature
extraction process. Specifi-cally, we first propose a new event representation
for GET, named Group Token, which groups asynchronous events based on their
timestamps and polarities. Then, GET ap-plies the Event Dual Self-Attention
block, and Group Token Aggregation module to facilitate effective feature
commu-nication and integration in both the spatial and temporal-polarity
domains. After that, GET can be integrated with different downstream tasks by
connecting it with vari-ous heads. We evaluate our method on four event-based
classification datasets (Cifar10-DVS, N-MNIST, N-CARS, and DVS128Gesture) and
two event-based object detection datasets (1Mpx and Gen1), and the results
demonstrate that GET outperforms other state-of-the-art methods. The code is
available at https://github.com/Peterande/GET-Group-Event-Transformer.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking [32.86991031493605]
Event-based bionic camera captures dynamic scenes with high temporal resolution and high dynamic range.
We propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters.
Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions.
arXiv Detail & Related papers (2024-09-26T06:12:08Z) - MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking [50.26836546224782]
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy.
The diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization.
This paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information.
arXiv Detail & Related papers (2024-04-18T11:09:25Z) - Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification [6.550582412924754]
This paper proposes a novel dual-stream framework for event representation, extraction, and fusion.
Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
arXiv Detail & Related papers (2023-08-23T06:07:56Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams [19.957857885844838]
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams.
We propose an attentionaware model named Event Voxel Set Transformer (EVSTr) for efficient representation learning on event streams.
Experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
arXiv Detail & Related papers (2023-03-07T12:48:02Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z) - Bridging the Gap between Events and Frames through Unsupervised Domain
Adaptation [57.22705137545853]
We propose a task transfer method that allows models to be trained directly with labeled images and unlabeled event data.
We leverage the generative event model to split event features into content and motion features.
Our approach unlocks the vast amount of existing image datasets for the training of event-based neural networks.
arXiv Detail & Related papers (2021-09-06T17:31:37Z) - AET-EFN: A Versatile Design for Static and Dynamic Event-Based Vision [33.4444564715323]
Event data are noisy, sparse, and nonuniform in the spatial-temporal domain with an extremely high temporal resolution.
Existing methods encode events into point-cloud-based or voxel-based representations, but suffer from noise and/or information loss.
This work proposes the Aligned Event Frame (AET) as a novel event data representation, and a neat framework called Event Frame Net (EFN)
The proposed AET and EFN are evaluated on various datasets, and proved to surpass existing state-of-the-art methods by large margins.
arXiv Detail & Related papers (2021-03-22T08:09:03Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.