Related papers: SODFormer: Streaming Object Detection with Transformer Using Events and Frames

SODFormer: Streaming Object Detection with Transformer Using Events and Frames

URL: http://arxiv.org/abs/2308.04047v1
Date: Tue, 8 Aug 2023 04:53:52 GMT
Title: SODFormer: Streaming Object Detection with Transformer Using Events and Frames
Authors: Dianze Li and Jianing Li and Yonghong Tian
Abstract summary: DA camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges. We propose a novel streaming object detector with SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner.
Score: 31.293847706713052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DAVIS camera, streaming two complementary sensing modalities of asynchronous events and frames, has gradually been used to address major object detection challenges (e.g., fast motion blur and low-light). However, how to effectively leverage rich temporal cues and fuse two heterogeneous visual streams remains a challenging endeavor. To address this challenge, we propose a novel streaming object detector with Transformer, namely SODFormer, which first integrates events and frames to continuously detect objects in an asynchronous manner. Technically, we first build a large-scale multimodal neuromorphic object detection dataset (i.e., PKU-DAVIS-SOD) over 1080.1k manual labels. Then, we design a spatiotemporal Transformer architecture to detect objects via an end-to-end sequence prediction problem, where the novel temporal Transformer module leverages rich temporal cues from two visual streams to improve the detection performance. Finally, an asynchronous attention-based fusion module is proposed to integrate two heterogeneous sensing modalities and take complementary advantages from each end, which can be queried at any time to locate objects and break through the limited output frequency from synchronized frame-based fusion strategies. The results show that the proposed SODFormer outperforms four state-of-the-art methods and our eight baselines by a significant margin. We also show that our unifying framework works well even in cases where the conventional frame-based camera fails, e.g., high-speed motion and low-light conditions. Our dataset and code can be available at https://github.com/dianzl/SODFormer.

Related papers

FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability. We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF) PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models. FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems. We propose a novel hierarchical feature refinement network for event-frame fusion. Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
Dual Memory Aggregation Network for Event-Based Object Detection with Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner. Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation. Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z)
Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z)
TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z)
Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS) Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage. We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
End-to-End Video Object Detection with Spatial-Temporal Transformers [33.40462554784311]
We present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. Our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset.
arXiv Detail & Related papers (2021-05-23T11:44:22Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
Asynchronous Tracking-by-Detection on Adaptive Time Surfaces for Event-based Object Tracking [87.0297771292994]
We propose an Event-based Tracking-by-Detection (ETD) method for generic bounding box-based object tracking. To achieve this goal, we present an Adaptive Time-Surface with Linear Time Decay (ATSLTD) event-to-frame conversion algorithm. We compare the proposed ETD method with seven popular object tracking methods, that are based on conventional cameras or event cameras, and two variants of ETD.
arXiv Detail & Related papers (2020-02-13T15:58:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.