Fugu-MT 論文翻訳(概要): JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

論文の概要: JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

arxiv url: http://arxiv.org/abs/2207.07895v1
Date: Sat, 16 Jul 2022 10:33:59 GMT
ステータス: 翻訳完了
システム内更新日: 2022-07-19 16:44:36.632094
Title: JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes
Title（参考訳）: JPerceiver:運転シーンにおける深さ・姿勢・レイアウト推定のための共同知覚ネットワーク
Authors: Haimei Zhao, Jing Zhang, Sen Zhang, Dacheng Tao
Abstract要約: JPerceiverは、モノクロビデオシーケンスからスケール認識深度とVOとBEVレイアウトを同時に推定することができる。クロスビュー幾何変換(CGT)を利用して、絶対スケールを道路レイアウトから奥行きとVOに伝播させる。 Argoverse、Nuscenes、KITTIの実験は、上記の3つのタスクの全てにおいて、既存のメソッドよりもJPerceiverの方が優れていることを示している。
参考スコア（独自算出の注目度）: 75.20435924081585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Depth estimation, visual odometry (VO), and bird's-eye-view (BEV) scene layout estimation present three critical tasks for driving scene perception, which is fundamental for motion planning and navigation in autonomous driving. Though they are complementary to each other, prior works usually focus on each individual task and rarely deal with all three tasks together. A naive way is to accomplish them independently in a sequential or parallel manner, but there are many drawbacks, i.e., 1) the depth and VO results suffer from the inherent scale ambiguity issue; 2) the BEV layout is directly predicted from the front-view image without using any depth-related information, although the depth map contains useful geometry clues for inferring scene layouts. In this paper, we address these issues by proposing a novel joint perception framework named JPerceiver, which can simultaneously estimate scale-aware depth and VO as well as BEV layout from a monocular video sequence. It exploits the cross-view geometric transformation (CGT) to propagate the absolute scale from the road layout to depth and VO based on a carefully-designed scale loss. Meanwhile, a cross-view and cross-modal transfer (CCT) module is devised to leverage the depth clues for reasoning road and vehicle layout through an attention mechanism. JPerceiver can be trained in an end-to-end multi-task learning way, where the CGT scale loss and CCT module promote inter-task knowledge transfer to benefit feature learning of each task. Experiments on Argoverse, Nuscenes and KITTI show the superiority of JPerceiver over existing methods on all the above three tasks in terms of accuracy, model size, and inference speed. The code and models are available at~\href{https://github.com/sunnyHelen/JPerceiver}{https://github.com/sunnyHelen/JPerceiver}.
Abstract（参考訳）: 奥行き推定,ビジュアル・オドメトリー(VO),鳥眼視(BEV)シーンレイアウト推定は,自律走行における動作計画とナビゲーションの基礎となる3つの重要な課題である。それらは互いに補完的だが、以前の作業は通常個々のタスクに集中し、3つのタスクすべてを一緒に扱うことは滅多にない。ナイーブな方法は、シーケンシャルまたは並列な方法でそれらを独立に達成することであるが、多くの欠点がある。 1) 深度及びVO結果は,本質的な規模あいまいさの問題に悩まされる。 2) 深度マップはシーンレイアウトの推測に有用な幾何学的手がかりを含むが, 深度関連情報を用いることなく, 前面画像からBEVレイアウトを直接予測する。本稿では,jperceiverという新たな統合知覚フレームワークを提案し,単眼映像列からスケール認識深度とvo,およびbevレイアウトを同時に推定する手法を提案する。クロスビュー幾何変換(CGT)を利用して、慎重に設計されたスケールロスに基づいて、絶対スケールを道路レイアウトから深さまで伝播する。一方,道路や車両のレイアウトを注意機構を通じて推論するための奥行き情報を活用するために,クロスビュー・クロスモーダルトランスファー(CCT)モジュールが開発された。 JPerceiverは、CGTスケールロスとCCTモジュールがタスク間の知識伝達を促進し、各タスクの特徴学習に役立てる、エンドツーエンドのマルチタスク学習方法で訓練することができる。 Argoverse、Nuscenes、KITTIの実験は、上記の3つのタスクのすべてにおいて、精度、モデルサイズ、推論速度の点で、JPerceiverが既存のメソッドよりも優れていることを示している。コードとモデルは、~\href{https://github.com/sunnyHelen/JPerceiver}{https://github.com/sunnyHelen/JPerceiver} で入手できる。

論文の概要: JPerceiver: Joint Perception Network for Depth, Pose and Layout Estimation in Driving Scenes

関連論文リスト