Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification
- URL: http://arxiv.org/abs/2308.11937v1
- Date: Wed, 23 Aug 2023 06:07:56 GMT
- Title: Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion
based Classification
- Authors: Chengguo Yuan, Yu Jin, Zongzhen Wu, Fanting Wei, Yangzirui Wang, Lan
Chen, and Xiao Wang
- Abstract summary: This paper proposes a novel dual-stream framework for event representation, extraction, and fusion.
Experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets.
- Score: 6.550582412924754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing target objects using an event-based camera draws more and more
attention in recent years. Existing works usually represent the event streams
into point-cloud, voxel, image, etc, and learn the feature representations
using various deep neural networks. Their final results may be limited by the
following factors: monotonous modal expressions and the design of the network
structure. To address the aforementioned challenges, this paper proposes a
novel dual-stream framework for event representation, extraction, and fusion.
This framework simultaneously models two common representations: event images
and event voxels. By utilizing Transformer and Structured Graph Neural Network
(GNN) architectures, spatial information and three-dimensional stereo
information can be learned separately. Additionally, a bottleneck Transformer
is introduced to facilitate the fusion of the dual-stream information.
Extensive experiments demonstrate that our proposed framework achieves
state-of-the-art performance on two widely used event-based classification
datasets. The source code of this work is available at:
\url{https://github.com/Event-AHU/EFV_event_classification}
Related papers
- Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - GET: Group Event Transformer for Event-Based Vision [82.312736707534]
Event cameras are a type of novel neuromorphic sen-sor that has been gaining increasing attention.
We propose a novel Group-based vision Transformer backbone for Event-based vision, called Group Event Transformer (GET)
GET de-couples temporal-polarity information from spatial infor-mation throughout the feature extraction process.
arXiv Detail & Related papers (2023-10-04T08:02:33Z) - Event Stream-based Visual Object Tracking: A High-Resolution Benchmark
Dataset and A Novel Baseline [38.42400442371156]
Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker.
We propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer.
We propose the first large-scale high-resolution ($1280 times 720$) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians, vehicles, UAVs, ping pongs, etc.
arXiv Detail & Related papers (2023-09-26T01:42:26Z) - SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition [42.118434116034194]
We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
arXiv Detail & Related papers (2023-08-08T16:15:35Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams [19.957857885844838]
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams.
We propose an attentionaware model named Event Voxel Set Transformer (EVSTr) for efficient representation learning on event streams.
Experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
arXiv Detail & Related papers (2023-03-07T12:48:02Z) - Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image
Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them.
Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities.
We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Event Transformer [43.193463048148374]
Event camera's low power consumption and ability to capture microsecond brightness make it attractive for various computer vision tasks.
Existing event representation methods typically convert events into frames, voxel grids, or spikes for deep neural networks (DNNs)
This work introduces a novel token-based event representation, where each event is considered a fundamental processing unit termed an event-token.
arXiv Detail & Related papers (2022-04-11T15:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.