Fugu-MT 論文翻訳(概要): FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

論文の概要: FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

arxiv url: http://arxiv.org/abs/2510.08849v1
Date: Thu, 09 Oct 2025 22:43:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.810348
Title: FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation
Title（参考訳）: FOLK: ラベル誘導知識蒸留による高速オープン語彙3次元インスタンスセグメンテーション
Authors: Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen, Zhihua Wei,
Abstract要約: Open-vocabulary 3Dのインスタンスセグメンテーションは、ラベル空間を超えてインスタンスをセグメンテーションし分類しようとする。ラベル誘導型知識蒸留(FOLK)による高速開語彙3次元インスタンス分割法を提案する。私たちの中心となる考え方は、高品質なインスタンス埋め込みを抽出し、オープン語彙の知識を3D学生モデルに抽出する教師モデルを設計することである。
参考スコア（独自算出の注目度）: 12.531301406732203
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.
Abstract（参考訳）: Open-vocabulary 3D インスタンスのセグメンテーションは、アノテーション付きラベル空間を超えてインスタンスをセグメンテーションし分類しようとする。既存の方法は3Dインスタンスを2D RGB-D画像にマッピングし、視覚言語モデル(VLM)を用いて分類する。しかし、そのようなマッピング戦略は、通常、2次元のオクルージョンからのノイズを導入し、推論中にかなりの計算とメモリコストを発生させ、推論速度を遅くする。上記の問題に対処するために,ラベル誘導知識蒸留(FOLK)を用いた高速開語彙3次元インスタンス分割法を提案する。私たちの中心となる考え方は、高品質なインスタンス埋め込みを抽出し、オープン語彙の知識を3D学生モデルに抽出する教師モデルを設計することである。このようにして、蒸留された3Dモデルは、3Dポイントクラウドから直接インスタンスを分類することができ、閉塞によるノイズを回避し、推論プロセスを著しく加速することができる。具体的には,まず,各3次元インスタンスに2次元CLIPを埋め込み,可視性と視点の多様性を両立させ,蒸留の学習ターゲットとなる教師モデルを設計する。次に、各3Dインスタンスに対して直接3D埋め込みを生成する3D学生モデルを開発する。学習中に,学習者モデルにラベルに一貫性のある2D埋め込みからオープン語彙知識を抽出するラベル誘導蒸留アルゴリズムを提案する。 FOLKはScanNet200データセットとReplicaデータセットの実験を行い、AP50スコアの35.7でScanNet200データセットの最先端のパフォーマンスを達成した。論文が受理された後、すべてのコードは公表される。

論文の概要: FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

関連論文リスト