Fugu-MT 論文翻訳(概要): CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

論文の概要: CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

arxiv url: http://arxiv.org/abs/2604.17024v1
Date: Sat, 18 Apr 2026 15:14:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.295593
Title: CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
Title（参考訳）: CAM3DNet:マルチビューカメラを用いた3Dオブジェクト検出のためのマルチスケール機能を網羅的にマイニングする
Authors: Mingxi Pang, Dingheng Wang, Zekun Li, Zhenping Sun, Bo Wang, Zhihang Wang, Zhao-Xu Yang,
Abstract要約: CAM3DNetは、複合(CQ)、適応自己注意(ASA)、マルチスケールハイブリッドサンプリング(MSHS)を組み合わせた、新しいクエリベースのフレームワークである。
参考スコア（独自算出の注目度）: 6.46812874971512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Query-based 3D object detection methods using multi-view images often struggle to efficiently leverage dynamic multi-scale information, e.g., the relationship between the object features and the geometric of the queries are not sufficiently learned, directly exploring the multi-scale spatiotemporal features will pay too many costs. To address these challenges, we propose CAM3DNet, a novel sparse query-based framework which combines three new modules, composite query (CQ), adaptive self-attention (ASA), and multi-scale hybrid sampling (MSHS). First, the core idea in the CQ module is a multi-scale projection strategy to transform 2D queries into 3D space. Second, the ASA module learns the interactions between the spatiotemporal multi-scale queries. Third, the MSHS module uses the deformable attention mechanism to sample multi-scale object information by considering multi-scales queries, pyramid feature maps, and 2D-camera prior knowledge. The entire model employs a backbone network and a feature pyramid network (FPN) as the encoder, then introduces a YOLOX and a DepthNet as a ROI\_Head to produce CQ, and repeatedly utilizes ASA and MSHS as the decoder to gain detection features. Extensive experiments on the nuScenes, Waymo, and Argoverse benchmark datasets demonstrate the effectiveness of our CAM3DNet, and most existing camera-based 3D object detection methods are outperformed. Besides, we make comprehensive ablation studies to check the individual effect of CQ, ASA, and MSHS, as well as their cost of space and computation complexity.
Abstract（参考訳）: マルチビュー画像を用いたクエリベースの3Dオブジェクト検出手法は、動的マルチスケール情報(例えば、オブジェクトの特徴とクエリの幾何学的関係)を効率的に活用するのに苦労することが多い。 CAM3DNetは3つの新しいモジュール、複合クエリ(CQ)、適応自己アテンション(ASA)、マルチスケールハイブリッドサンプリング(MSHS)を組み合わせた新しいスパースクエリベースのフレームワークである。まず、CQモジュールの中核となるアイデアは、2Dクエリを3D空間に変換するマルチスケールプロジェクション戦略である。次に、ASAモジュールは時空間的マルチスケールクエリ間の相互作用を学習する。第三に、MSHSモジュールは変形可能なアテンション機構を使用して、マルチスケールクエリ、ピラミッド特徴マップ、および2Dカメラ事前知識を考慮して、マルチスケールオブジェクト情報をサンプリングする。モデル全体では、バックボーンネットワークと機能ピラミッドネットワーク(FPN)をエンコーダとして使用し、その後、ROI\_HeadとしてYOLOXとDepthNetを導入してCQを生成し、ASAとMSHSをデコーダとして繰り返し使用して検出機能を得る。 nuScenes、Waymo、Argoverseのベンチマークデータセットに関する大規模な実験は、私たちのCAM3DNetの有効性を示しており、既存のカメラベースの3Dオブジェクト検出方法の方が優れています。さらに,CQ,ASA,MSHSの個々の効果,空間および計算複雑性のコストを総合的に検討する。

論文の概要: CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

関連論文リスト