Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for
Autonomous Driving
- URL: http://arxiv.org/abs/2205.14882v1
- Date: Mon, 30 May 2022 06:41:10 GMT
- Title: Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for
Autonomous Driving
- Authors: Peixuan Li, Jieyu Jin
- Abstract summary: We propose jointly training 3D detection and 3D tracking from only monocular videos in an end-to-end manner.
Time3D achieves 21.4% AMOTA, 13.6% AMOTP on the nuScenes 3D tracking benchmark, surpassing all published competitors.
- Score: 3.8073142980733
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While separately leveraging monocular 3D object detection and 2D multi-object
tracking can be straightforwardly applied to sequence images in a
frame-by-frame fashion, stand-alone tracker cuts off the transmission of the
uncertainty from the 3D detector to tracking while cannot pass tracking error
differentials back to the 3D detector. In this work, we propose jointly
training 3D detection and 3D tracking from only monocular videos in an
end-to-end manner. The key component is a novel spatial-temporal information
flow module that aggregates geometric and appearance features to predict robust
similarity scores across all objects in current and past frames. Specifically,
we leverage the attention mechanism of the transformer, in which self-attention
aggregates the spatial information in a specific frame, and cross-attention
exploits relation and affinities of all objects in the temporal domain of
sequence frames. The affinities are then supervised to estimate the trajectory
and guide the flow of information between corresponding 3D objects. In
addition, we propose a temporal
-consistency loss that explicitly involves 3D target motion modeling into the
learning, making the 3D trajectory smooth in the world coordinate system.
Time3D achieves 21.4\% AMOTA, 13.6\% AMOTP on the nuScenes 3D tracking
benchmark, surpassing all published competitors, and running at 38 FPS, while
Time3D achieves 31.2\% mAP, 39.4\% NDS on the nuScenes 3D detection benchmark.
Related papers
- TAPVid-3D: A Benchmark for Tracking Any Point in 3D [63.060421798990845]
We introduce a new benchmark, TAPVid-3D, for evaluating the task of Tracking Any Point in 3D.
This benchmark will serve as a guidepost to improve our ability to understand precise 3D motion and surface deformation from monocular video.
arXiv Detail & Related papers (2024-07-08T13:28:47Z) - Delving into Motion-Aware Matching for Monocular 3D Object Tracking [81.68608983602581]
We find that the motion cue of objects along different time frames is critical in 3D multi-object tracking.
We propose MoMA-M3T, a framework that mainly consists of three motion-aware components.
We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate our MoMA-M3T achieves competitive performance against state-of-the-art methods.
arXiv Detail & Related papers (2023-08-22T17:53:58Z) - A Lightweight and Detector-free 3D Single Object Tracker on Point Clouds [50.54083964183614]
It is non-trivial to perform accurate target-specific detection since the point cloud of objects in raw LiDAR scans is usually sparse and incomplete.
We propose DMT, a Detector-free Motion prediction based 3D Tracking network that totally removes the usage of complicated 3D detectors.
arXiv Detail & Related papers (2022-03-08T17:49:07Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - 3D Visual Tracking Framework with Deep Learning for Asteroid Exploration [22.808962211830675]
In this paper, we focus on the studied accurate and real-time method for 3D tracking.
A new large-scale 3D asteroid tracking dataset is presented, including binocular video sequences, depth maps, and point clouds of diverse asteroids.
We propose a deep-learning based 3D tracking framework, named as Track3D, which involves 2D monocular tracker and a novel light-weight amodal axis-aligned bounding-box network, A3BoxNet.
arXiv Detail & Related papers (2021-11-21T04:14:45Z) - Monocular Quasi-Dense 3D Object Tracking [99.51683944057191]
A reliable and accurate 3D tracking framework is essential for predicting future locations of surrounding objects and planning the observer's actions in numerous applications such as autonomous driving.
We propose a framework that can effectively associate moving objects over time and estimate their full 3D bounding box information from a sequence of 2D images captured on a moving platform.
arXiv Detail & Related papers (2021-03-12T15:30:02Z) - Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking
from View Aggregation [8.854112907350624]
3D multi-object tracking plays a vital role in autonomous navigation.
Many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space.
We propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames.
arXiv Detail & Related papers (2020-11-25T16:14:40Z) - Tracking from Patterns: Learning Corresponding Patterns in Point Clouds
for 3D Object Tracking [34.40019455462043]
We propose to learn 3D object correspondences from temporal point cloud data and infer the motion information from correspondence patterns.
Our method exceeds the existing 3D tracking methods on both the KITTI and larger scale Nuscenes dataset.
arXiv Detail & Related papers (2020-10-20T06:07:20Z) - Kinematic 3D Object Detection in Monocular Video [123.7119180923524]
We propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization.
We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset.
arXiv Detail & Related papers (2020-07-19T01:15:12Z) - DeepTracking-Net: 3D Tracking with Unsupervised Learning of Continuous
Flow [12.690471276907445]
This paper deals with the problem of 3D tracking, i.e., to find dense correspondences in a sequence of time-varying 3D shapes.
We propose a novel unsupervised 3D shape framework named DeepTracking-Net, which uses deep neural networks (DNNs) as auxiliary functions.
In addition, we prepare a new synthetic 3D data, named SynMotions, to the 3D tracking and recognition community.
arXiv Detail & Related papers (2020-06-24T16:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.