Fugu-MT 論文翻訳(概要): Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

論文の概要: Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

arxiv url: http://arxiv.org/abs/2604.12918v1
Date: Tue, 14 Apr 2026 16:00:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.552504
Title: Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation
Title（参考訳）: 3次元共同検出・分割のためのクロスタスクアテンションブリッジを用いたレーダカメラBEVマルチタスク学習
Authors: Ahmet İnanç, Özgür Erkent,
Abstract要約: textbfCTAB (Cross-Task Attention Bridge) は,検出とセグメンテーションのブランチ間で特徴を交換するモジュールである。 nuScenesでは、CTABは基本的に中立な検出において、関節のマルチタスクベースライン上の7クラスのセグメンテーションを改善している。 4種類のサブセット(運転可能エリア,歩行者横断,歩道,車両)では,共同マルチタスクモデルが4つのクラスでmIoUに到達し,同時に3D検出が可能となった。
参考スコア（独自算出の注目度）: 0.6187780920448871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity to share complementary information between them: detection features encode object-level geometry that can sharpen segmentation boundaries, while segmentation features provide dense semantic context that can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model reaches comparable mIoU on 4 classes while simultaneously providing 3D detection.
Abstract（参考訳）: バードアイビュー(Bird's-eye-view、BEV)の表現は、自律運転における3次元知覚において支配的なパラダイムであり、同じ物理座標系に、検出とセグメンテーションの特徴が幾何学的に登録されるような、統一された空間キャンバスを提供する。しかし、既存のレーダー・カメラ融合法では、これらのタスクを分離して処理し、それら間で補完的な情報を共有する機会を欠いている: 検出機能はセグメンテーション境界を鋭くするオブジェクトレベルの幾何学を符号化し、セグメンテーション機能は、検出をアンカーできる密集したセグメンテーションコンテキストを提供する。本稿では,共有BEV空間における複数スケールの変形可能な注意による検出とセグメンテーションブランチ間の特徴を交換する双方向モジュールである「textbf{CTAB} (Cross-Task Attention Bridge)」を提案する。 CTABは、インスタンス正規化に基づくセグメンテーションデコーダと学習可能なBEVアップサンプリングを備えたマルチタスクフレームワークに統合され、より詳細なBEV表現を提供する。 nuScenesでは、CTABは基本的に中立な検出において、関節のマルチタスクベースライン上の7クラスのセグメンテーションを改善している。 4種類のサブセット(運転可能エリア,歩行者横断,歩道,車両)では,共同マルチタスクモデルが4つのクラスでmIoUに到達し,同時に3D検出が可能となった。

論文の概要: Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

関連論文リスト