Fugu-MT 論文翻訳(概要): UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

論文の概要: UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

arxiv url: http://arxiv.org/abs/2605.07356v1
Date: Fri, 08 May 2026 07:09:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.878011
Title: UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition
Title（参考訳）: UniD-Shift: 解釈可能な共有型マルチモーダル分解による統一セマンティックセマンティック分割を目指して
Authors: Shuai Zhang, Zhecheng Shi, Zhuxiao Li, Jing Ou, Tengxi Wang, Yuan Liu, Wufan Zhao,
Abstract要約: 本稿では,2次元と3次元のセマンティックセグメンテーションのための統合フレームワークを提案する。我々は、SAMベースの視覚エンコーダとSPTNetベースの幾何学エンコーダを組み合わせて、補完的意味論と幾何学的表現を抽出する。軽量アテンションベースの融合モジュールは、共有された機能を一貫したクロスモーダル表現に集約する。
参考スコア（独自算出の注目度）: 9.578297917595377
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.
Abstract（参考訳）: 大規模3Dポイントクラウドのセマンティックセグメンテーションは、自律運転や都市デジタルツインといった応用に不可欠である。しかし、LiDARのスパースサンプリングパターンと画像観察におけるビュー依存幾何歪みは、交差モーダルアライメントを複雑にし、安定した融合を妨げる。カメラが捉えた2D画像が3D世界の表現であるという事実に触発されて、我々は2Dと3Dのセグメンテーションから学んだ特徴が共通のセマンティクスを共有していることを認識した。この洞察は、関節2D-3Dセマンティックセグメンテーションのための統一されたマルチモーダルフレームワークを動機付けている。我々は、SAMベースの視覚エンコーダとSPTNetベースの幾何学エンコーダを組み合わせて、補完的意味論と幾何学的表現を抽出する。両方のモダリティから得られる特徴は、共有部分空間とプライベート部分空間に明示的に分解され、共有コンポーネントは、両方のドメインに共通する意味的要素を要約し、プライベートコンポーネントは、各モダリティに固有のプロパティを保持する。軽量アテンションベースの融合モジュールは共有機能を一貫したクロスモーダル表現に集約し、正規化されたトレーニング目的はセマンティックアライメントとサブスペース独立の両方を保証する。 SemanticKITTIとnuScenesベンチマークの実験は、競合計算効率を伴う代表的マルチモーダルベースラインよりもセグメンテーション精度が一貫した改善を示した。 nuScenes USA-Singaporeのクロスドメイン評価は、分布シフト下での安定な性能を示し、強力な一般化を示す。実装コードは、https://github.com/shuaizhang69/UniD-Shiftで公開されている。

論文の概要: UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

関連論文リスト