SODFormer: Streaming Object Detection with Transformer Using Events and
Frames
- URL: http://arxiv.org/abs/2308.04047v1
- Date: Tue, 8 Aug 2023 04:53:52 GMT
- Title: SODFormer: Streaming Object Detection with Transformer Using Events and
Frames
- Authors: Dianze Li and Jianing Li and Yonghong Tian
- Abstract summary: DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges.
We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
- Score: 31.293847706713052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DAVIS camera, streaming two complementary sensing modalities of asynchronous
events and frames, has gradually been used to address major object detection
challenges (e.g., fast motion blur and low-light). However, how to effectively
leverage rich temporal cues and fuse two heterogeneous visual streams remains a
challenging endeavor. To address this challenge, we propose a novel streaming
object detector with Transformer, namely SODFormer, which first integrates
events and frames to continuously detect objects in an asynchronous manner.
Technically, we first build a large-scale multimodal neuromorphic object
detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we
design a spatiotemporal Transformer architecture to detect objects via an
end-to-end sequence prediction problem, where the novel temporal Transformer
module leverages rich temporal cues from two visual streams to improve the
detection performance. Finally, an asynchronous attention-based fusion module
is proposed to integrate two heterogeneous sensing modalities and take
complementary advantages from each end, which can be queried at any time to
locate objects and break through the limited output frequency from synchronized
frame-based fusion strategies. The results show that the proposed SODFormer
outperforms four state-of-the-art methods and our eight baselines by a
significant margin. We also show that our unifying framework works well even in
cases where the conventional frame-based camera fails, e.g., high-speed motion
and low-light conditions. Our dataset and code can be available at
https://github.com/dianzl/SODFormer.
Related papers
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems.
We propose a novel hierarchical feature refinement network for event-frame fusion.
Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - TransVOD: End-to-end Video Object Detection with Spatial-Temporal
Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures.
Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP.
Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - End-to-End Video Object Detection with Spatial-Temporal Transformers [33.40462554784311]
We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture.
Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring.
These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
arXiv Detail & Related papers (2021-05-23T11:44:22Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z) - Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for
Event-based Object Tracking [87.0297771292994]
We propose an Event-based Tracking-by-Detection (ETD) method for generic bounding box-based object tracking.
To achieve this goal, we present an Adaptive Time-Surface with Linear Time Decay (ATSLTD) event-to-frame conversion algorithm.
We compare the proposed ETD method with seven popular object tracking methods, that are based on conventional cameras or event cameras, and two variants of ETD.
arXiv Detail & Related papers (2020-02-13T15:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.