Fugu-MT 論文翻訳(概要): PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

論文の概要: PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

arxiv url: http://arxiv.org/abs/2601.02457v1
Date: Mon, 05 Jan 2026 18:55:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-07 17:02:12.683689
Title: PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding
Title（参考訳）: PatchAlign3D:Dense 3D Shape Understandingのための局所的特徴アライメント
Authors: Souhail Hadgi, Bingchen Gong, Ramana Sundararaman, Emery Pierson, Lei Li, Peter Wonka, Maks Ovsjanikov,
Abstract要約: 現在の3次元形状の基礎モデルは、グローバルなタスク(検索、分類)において優れているが、局所的な部分レベルの推論には不十分である。本稿では,ポイントクラウドから直接,言語対応のパッチレベル機能を生成するエンコーダのみの3Dモデルを提案する。我々の3Dエンコーダは、テストタイムのマルチビューレンダリングなしで高速なシングルパス推論によるゼロショット3D部分分割を実現する。
参考スコア（独自算出の注目度）: 67.15800065888887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to directly solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language-model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes. We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective. Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference without any test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks. Project website: https://souhail-hadgi.github.io/patchalign3dsite/
Abstract（参考訳）: 現在の3次元形状の基礎モデルは、グローバルなタスク(検索、分類)において優れているが、局所的な部分レベルの推論には不十分である。最近のアプローチでは、視覚と言語基盤モデルを利用して、マルチビューレンダリングやテキストクエリを通じて、密集したタスクを直接解決している。将来性はあるものの、これらのパイプラインは複数のレンダリングに対して高価な推論を必要としており、大きな言語モデル(LLM)に大きく依存しており、キャプションのエンジニアリングを促し、形状の固有の3D幾何学を活用できない。ポイントクラウドから直接,言語対応のパッチレベル機能を生成するエンコーダのみの3Dモデルを導入することで,このギャップに対処する。我々の事前学習アプローチは、VLMキャプションとマルチビューSAM領域をペアリングすることで、パートアノテートされた3次元形状を生成する既存のデータエンジンに基づいている。このデータを用いて,(1)DINOv2などの視覚エンコーダから3Dパッチへの高密度な2次元特徴の蒸留,(2)部分レベルのテキスト埋め込みによるこれらのパッチ埋め込みのアライメントという2つの段階において,ポイントクラウドトランスフォーマーエンコーダを訓練する。我々の3Dエンコーダは、テストタイムのマルチビューレンダリングなしで高速なシングルパス推論によるゼロショット3D部分セグメンテーションを実現すると同時に、複数の3D部分セグメンテーションベンチマークにおいて、以前のレンダリングベースおよびフィードフォワードアプローチを大幅に上回っている。プロジェクトサイト:https://souhail-hadgi.github.io/patchalign3dsite/

論文の概要: PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding

関連論文リスト