Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP
Benchmark
- URL: http://arxiv.org/abs/2212.08914v1
- Date: Sat, 17 Dec 2022 16:32:15 GMT
- Title: Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP
Benchmark
- Authors: Xiaofeng Wang, Zheng Zhu, Yunpeng Zhang, Guan Huang, Yun Ye, Wenbo Xu,
Ziwei Chen, Xingang Wang
- Abstract summary: ASAP is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving.
We propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images.
In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints.
- Score: 23.872360763782037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, vision-centric perception has flourished in various
autonomous driving tasks, including 3D detection, semantic map construction,
motion forecasting, and depth estimation. Nevertheless, the latency of
vision-centric approaches is too high for practical deployment (e.g., most
camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap
between ideal research and real-world applications, it is necessary to quantify
the trade-off between performance and efficiency. Traditionally,
autonomous-driving perception benchmarks perform the offline evaluation,
neglecting the inference time delay. To mitigate the problem, we propose the
Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first
benchmark to evaluate the online performance of vision-centric perception in
autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we
first propose an annotation-extending pipeline to generate high-frame-rate
labels for the 12Hz raw images. Referring to the practical deployment, the
Streaming Perception Under constRained-computation (SPUR) evaluation protocol
is further constructed, where the 12Hz inputs are utilized for streaming
evaluation under the constraints of different computational resources. In the
ASAP benchmark, comprehensive experiment results reveal that the model rank
alters under different constraints, suggesting that the model latency and
computation budget should be considered as design choices to optimize the
practical deployment. To facilitate further research, we establish baselines
for camera-based streaming 3D detection, which consistently enhance the
streaming performance across various hardware. ASAP project page:
https://github.com/JeffWang987/ASAP.
Related papers
- DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving [27.14089002387224]
We present DAMO-StreamNet, an optimized framework for streaming perception.
The framework combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms.
Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data.
arXiv Detail & Related papers (2023-03-30T04:34:31Z) - A Simple Framework for 3D Occupancy Estimation in Autonomous Driving [16.605853706182696]
We present a CNN-based framework designed to reveal several key factors for 3D occupancy estimation.
We also explore the relationship between 3D occupancy estimation and other related tasks, such as monocular depth estimation and 3D reconstruction.
arXiv Detail & Related papers (2023-03-17T15:57:14Z) - OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic
Occupancy Perception [73.05425657479704]
We propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark.
We extend the large-scale nuScenes dataset with dense semantic occupancy annotations.
Considering the complexity of surrounding occupancy perception, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction.
arXiv Detail & Related papers (2023-03-07T15:43:39Z) - StreamYOLO: Real-time Object Detection for Streaming Perception [84.2559631820007]
We endow the models with the capacity of predicting the future, significantly improving the results for streaming perception.
We consider multiple velocities driving scene and propose Velocity-awared streaming AP (VsAP) to jointly evaluate the accuracy.
Our simple method achieves the state-of-the-art performance on Argoverse-HD dataset and improves the sAP and VsAP by 4.7% and 8.2% respectively.
arXiv Detail & Related papers (2022-07-21T12:03:02Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Real Time Monocular Vehicle Velocity Estimation using Synthetic Data [78.85123603488664]
We look at the problem of estimating the velocity of road vehicles from a camera mounted on a moving car.
We propose a two-step approach where first an off-the-shelf tracker is used to extract vehicle bounding boxes and then a small neural network is used to regress the vehicle velocity.
arXiv Detail & Related papers (2021-09-16T13:10:27Z) - Real-time Streaming Perception System for Autonomous Driving [2.6058660721533187]
We present the real-time steaming perception system, which is also the 2nd Place solution of Streaming Perception Challenge.
Unlike traditional object detection challenges, which focus mainly on the absolute performance, streaming perception task requires achieving a balance of accuracy and latency.
On the Argoverse-HD test set, our method achieves 33.2 streaming AP (34.6 streaming AP verified by the organizer) under the required hardware.
arXiv Detail & Related papers (2021-07-30T01:32:44Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception.
The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant.
We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z) - Streaming Object Detection for 3-D Point Clouds [29.465873948076766]
LiDAR provides a prominent sensory modality that informs many existing perceptual systems.
The latency for perceptual systems based on point cloud data can be dominated by the amount of time for a complete rotational scan.
We show how operating on LiDAR data in its native streaming formulation offers several advantages for self driving object detection.
arXiv Detail & Related papers (2020-05-04T21:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.