Fugu-MT 論文翻訳(概要): xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

論文の概要: xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

arxiv url: http://arxiv.org/abs/2605.30111v1
Date: Thu, 28 May 2026 15:48:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.450794
Title: xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR
Title（参考訳）: xModel-KD:LiDARを用いた3次元シーン知覚のためのクロスモーダル知識蒸留
Authors: Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan,
Abstract要約: 本稿では,3次元点雲分割のためのクロスモーダルな知識蒸留フレームワーク xModel-KD を提案する。本手法は,2次元テクスチャと3次元幾何の相補的な長所を利用して,一意な点ごとの表現を学習する。実験結果から,LiDARのみのベースラインよりもmIoUが2%絶対的に向上することが示唆された。
参考スコア（独自算出の注目度）: 1.0055428846517074
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.
Abstract（参考訳）: ポイントクラウドセグメンテーションは3Dシーン理解における基本的なタスクである。その進行は、高密度な3Dアノテーションに必要な高コストと時間によって制約されており、ラベル付きサンプルを得るのが困難である。アノテーションの不足に加えて、異なる感覚のモダリティは固有の制限に直面している。 2D画像は豊かなテクスチャと外観の手がかりを提供するが、明らかな深さと幾何学的構造は欠如している。対照的に、3次元の点雲は正確な空間幾何学を捉えているが、スパースであり、テクスチャ情報を含んでいない。結果として、単一のモジュラリティに依存することは、学習された表現の豊かさを制限し、一般化を弱める。近年の3次元点雲と2次元画像を組み合わせたマルチモーダル手法は,分類や検索などのタスクにおいて高い性能を示してきたが,これらは大規模ラベル付きデータセットに依存しており,データ効率の高い高密度予測には十分に活用されていない。これらの制約に対処するため、3次元点雲分割のための新しいクロスモーダルな知識蒸留フレームワーク xModel-KD を提案する。本手法は,2次元テクスチャと3次元幾何の相補的な強度を利用して,一点あたりの表現を相互にアライメントすることで学習する。具体的には、複数のビューにまたがる対応する2次元および3次元表現間の特徴整合性を強制する、対照的な目的で訓練されたクロスモーダル融合エンコーダを設計する。提案手法は,強力な事前学習されたバックボーンと目標核融合戦略を統合することにより,画像から幾何学的特徴への外観手がかりの伝達を効果的に行う。実験結果から,LiDARのみのベースラインよりも2パーセントの絶対的なmIoU向上を実現し,拡張性とアノテーション効率のよい3Dシーン理解に相補的なマルチモーダル情報を活用するメリットが示された。

論文の概要: xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

関連論文リスト