Recurrent Vision Transformers for Object Detection with Event Cameras
- URL: http://arxiv.org/abs/2212.05598v3
- Date: Thu, 25 May 2023 09:17:11 GMT
- Title: Recurrent Vision Transformers for Object Detection with Event Cameras
- Authors: Mathias Gehrig and Davide Scaramuzza
- Abstract summary: We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras.
RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection.
Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.
- Score: 62.27246562304705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Recurrent Vision Transformers (RVTs), a novel backbone for object
detection with event cameras. Event cameras provide visual information with
sub-millisecond latency at a high-dynamic range and with strong robustness
against motion blur. These unique properties offer great potential for
low-latency object detection and tracking in time-critical scenarios. Prior
work in event-based vision has achieved outstanding detection performance but
at the cost of substantial inference time, typically beyond 40 milliseconds. By
revisiting the high-level design of recurrent vision backbones, we reduce
inference time by a factor of 6 while retaining similar performance. To achieve
this, we explore a multi-stage design that utilizes three key concepts in each
stage: First, a convolutional prior that can be regarded as a conditional
positional embedding. Second, local and dilated global self-attention for
spatial feature interaction. Third, recurrent temporal feature aggregation to
minimize latency while retaining temporal information. RVTs can be trained from
scratch to reach state-of-the-art performance on event-based object detection -
achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time,
RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter
efficiency (5 times fewer than prior art). Our study brings new insights into
effective design choices that can be fruitful for research beyond event-based
vision.
Related papers
- Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences [25.74000325019015]
We introduce a novel LiDAR 3D object detection framework, namely LiSTM, to facilitate spatial-temporal feature learning with cross-frame motion forecasting information.
We have conducted experiments on the aggregation and nuScenes datasets to demonstrate that the proposed framework achieves superior 3D detection performance.
arXiv Detail & Related papers (2024-09-06T16:29:04Z) - MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking [50.26836546224782]
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy.
The diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization.
This paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information.
arXiv Detail & Related papers (2024-04-18T11:09:25Z) - SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features [52.213656737672935]
SpikeMOT is an event-based multi-object tracker.
SpikeMOT uses spiking neural networks to extract sparsetemporal features from event streams associated with objects.
arXiv Detail & Related papers (2023-09-29T05:13:43Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention [50.11672196146829]
3D object detection with surround-view images is an essential task for autonomous driving.
We propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images.
arXiv Detail & Related papers (2022-12-15T14:18:47Z) - Event-based YOLO Object Detection: Proof of Concept for Forward
Perception System [0.3058685580689604]
This study focuses on leveraging neuromorphic event data for roadside object detection.
In this article, the event-simulated A2D2 dataset is manually annotated and trained on two different YOLOv5 networks.
arXiv Detail & Related papers (2022-12-14T12:12:29Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.