Fugu-MT 論文翻訳(概要): Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

論文の概要: Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

arxiv url: http://arxiv.org/abs/2603.07486v1
Date: Sun, 08 Mar 2026 06:10:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.627661
Title: Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection
Title（参考訳）: ロバスト3次元物体検出のためのマルチモード分離・再結合ネットワーク
Authors: Rui Ding, Zhaonian Kuang, Yuzhe Ji, Meng Yang, Xinhu Zheng, Gang Hua,
Abstract要約: データ破損下でのロバストな3次元オブジェクト検出のためのマルチモーダルデコプル・リカップリングネットワークを提案する。我々のモデルは、最近のモデルと比較して、腐敗したデータとクリーンなデータの両方において、常に最高の精度を達成する。
参考スコア（独自算出の注目度）: 20.541042952048862
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.
Abstract（参考訳）: 鳥眼ビュー(BEV)を用いたマルチモーダル3Dオブジェクト検出は、ベンチマークにおいて望ましい進歩を遂げている。それでも、LiDARのセンサー構成やカメラのシーン条件などのデータ破損により、実際の世界では精度が大幅に低下する可能性がある。従来のモデルの1つの設計ボトルネックは、核融合中の多モードのBEV機能の密結合にあり、1つのモダリティまたは両方が破損した場合、システム全体の性能が低下する可能性がある。そこで本研究では,データ破損によるロバストな3次元オブジェクト検出のためのマルチモーダルデコプル・リカップリングネットワークを提案する。異なるモジュラリティは、いくつかの高レベル不変性を共有するのが一般的である。データ破損の種類が異なるため、これらの不変性が常に同時にフェールするとは限らないことを我々は観察し、これらの不変性は、データ破損下での堅牢な融合のために、モダリティ全体にわたって復元することができる。そして、これらの特徴を3つの専門家、すなわちLiDAR、カメラ、両方に分割して、それぞれ異なる種類のデータ破損を処理する。各専門家は、モダリティ不変の特徴を頑健な情報として使用し、一方、モダリティ特有な特徴は補体として機能する。検証のために、我々は、LiDAR、カメラ、およびどちらもnuScenesに基づいて、大量のデータ破損のベンチマークを収集する。私たちのモデルはクリーンなnuSceneでトレーニングされ、あらゆる種類のデータ破損でテストされます。我々のモデルは、最近のモデルと比較して、腐敗したデータとクリーンなデータの両方において、常に最高の精度を達成する。

論文の概要: Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

関連論文リスト