JPerceiver: Joint Perception Network for Depth, Pose and Layout
Estimation in Driving Scenes
- URL: http://arxiv.org/abs/2207.07895v1
- Date: Sat, 16 Jul 2022 10:33:59 GMT
- Title: JPerceiver: Joint Perception Network for Depth, Pose and Layout
Estimation in Driving Scenes
- Authors: Haimei Zhao, Jing Zhang, Sen Zhang, Dacheng Tao
- Abstract summary: JPerceiver can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence.
It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO.
Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks.
- Score: 75.20435924081585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene
layout estimation present three critical tasks for driving scene perception,
which is fundamental for motion planning and navigation in autonomous driving.
Though they are complementary to each other, prior works usually focus on each
individual task and rarely deal with all three tasks together. A naive way is
to accomplish them independently in a sequential or parallel manner, but there
are many drawbacks, i.e., 1) the depth and VO results suffer from the inherent
scale ambiguity issue; 2) the BEV layout is directly predicted from the
front-view image without using any depth-related information, although the
depth map contains useful geometry clues for inferring scene layouts. In this
paper, we address these issues by proposing a novel joint perception framework
named JPerceiver, which can simultaneously estimate scale-aware depth and VO as
well as BEV layout from a monocular video sequence. It exploits the cross-view
geometric transformation (CGT) to propagate the absolute scale from the road
layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a
cross-view and cross-modal transfer (CCT) module is devised to leverage the
depth clues for reasoning road and vehicle layout through an attention
mechanism. JPerceiver can be trained in an end-to-end multi-task learning way,
where the CGT scale loss and CCT module promote inter-task knowledge transfer
to benefit feature learning of each task. Experiments on Argoverse, Nuscenes
and KITTI show the superiority of JPerceiver over existing methods on all the
above three tasks in terms of accuracy, model size, and inference speed. The
code and models are available
at~\href{https://github.com/sunnyHelen/JPerceiver}{https://github.com/sunnyHelen/JPerceiver}.
Related papers
- Scene as Occupancy [66.43673774733307]
OccNet is a vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy.
We propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes.
arXiv Detail & Related papers (2023-06-05T13:01:38Z) - Object Semantics Give Us the Depth We Need: Multi-task Approach to
Aerial Depth Completion [1.2239546747355885]
We propose a novel approach to jointly execute the two tasks in a single pass.
The proposed method is based on an encoder-focused multi-task learning model that exposes the two tasks to jointly learned features.
Experimental results show that the proposed multi-task network outperforms its single-task counterpart.
arXiv Detail & Related papers (2023-04-25T03:21:32Z) - Graph-based Topology Reasoning for Driving Scenes [102.35885039110057]
We present TopoNet, the first end-to-end framework capable of abstracting traffic knowledge beyond conventional perception tasks.
We evaluate TopoNet on the challenging scene understanding benchmark, OpenLane-V2.
arXiv Detail & Related papers (2023-04-11T15:23:29Z) - Explore before Moving: A Feasible Path Estimation and Memory Recalling
Framework for Embodied Navigation [117.26891277593205]
We focus on the navigation and solve the problem of existing navigation algorithms lacking experience and common sense.
Inspired by the human ability to think twice before moving and conceive several feasible paths to seek a goal in unfamiliar scenes, we present a route planning method named Path Estimation and Memory Recalling framework.
We show strong experimental results of PEMR on the EmbodiedQA navigation task.
arXiv Detail & Related papers (2021-10-16T13:30:55Z) - Structured Bird's-Eye-View Traffic Scene Understanding from Onboard
Images [128.881857704338]
We study the problem of extracting a directed graph representing the local road network in BEV coordinates, from a single onboard camera image.
We show that the method can be extended to detect dynamic objects on the BEV plane.
We validate our approach against powerful baselines and show that our network achieves superior performance.
arXiv Detail & Related papers (2021-10-05T12:40:33Z) - A Simple and Efficient Multi-task Network for 3D Object Detection and
Road Understanding [20.878931360708343]
We show that it is possible to perform all perception tasks via a simple and efficient multi-task network.
Our proposed network, LidarMTL, takes raw LiDAR point cloud as inputs, and predicts six perception outputs for 3D object detection and road understanding.
arXiv Detail & Related papers (2021-03-06T08:00:26Z) - Depth Based Semantic Scene Completion with Position Importance Aware
Loss [52.06051681324545]
PALNet is a novel hybrid network for semantic scene completion.
It extracts both 2D and 3D features from multi-stages using fine-grained depth information.
It is beneficial for recovering key details like the boundaries of objects and the corners of the scene.
arXiv Detail & Related papers (2020-01-29T07:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.