Fugu-MT 論文翻訳(概要): LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

論文の概要: LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

arxiv url: http://arxiv.org/abs/2604.01388v1
Date: Wed, 01 Apr 2026 20:48:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:09.971789
Title: LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
Title（参考訳）: LESV:オープンボキャブラリ3次元シーン理解のための言語組み込みスパースボクセルフュージョン
Authors: Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Fabien Moutarde,
Abstract要約: 本稿では,Sparse Voxel Rasterization (SVRaster) を構造的,不随伴な幾何学表現として活用する新しいフレームワークを提案する。これにより、決定論的で信頼性に配慮した特徴登録プロセスが可能となり、3DGSに共通する意味的出血アーティファクトが抑制される。提案手法は,Open Vocabulary 3D Object Retrieval と Point Cloud Understanding ベンチマークの最先端性能を実現する。
参考スコア（独自算出の注目度）: 9.377694035678948
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.
Abstract（参考訳）: オープン語彙3Dシーン理解の最近の進歩は、視覚言語の特徴を3D空間に登録する3Dガウススプラッティング(3DGS)に大きく依存している。しかし,これらアプローチでは,確率的特徴登録を必要とするガウス的重複や,細かな詳細を希釈するオブジェクトレベルのマスク上の特徴のプールによって生じる多層的意味的曖昧さという,非構造的かつ重なり合う空間的曖昧さの2つの限界が指摘されている。これらの課題に対処するために,Sparse Voxel Rasterization (SVRaster) を構造化された非接合幾何学表現として活用する新しいフレームワークを提案する。 SVRasterを単分子深度と通常の先行値で正規化することにより、安定な幾何学的基礎を確立する。これにより、決定論的で信頼性に配慮した特徴登録プロセスが可能となり、3DGSに共通する意味的出血アーティファクトが抑制される。さらに, 基礎モデルAM-RADIOの高密度アライメント特性を活用し, 階層的学習手法の計算オーバーヘッドを回避することで, 多段階の曖昧さを解消する。提案手法は,Open Vocabulary 3D Object Retrieval と Point Cloud Understanding ベンチマークにおける最先端のパフォーマンスを実現する。

論文の概要: LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding

関連論文リスト