Fugu-MT 論文翻訳(概要): 3D Primitives are a Spatial Language for VLMs

論文の概要: 3D Primitives are a Spatial Language for VLMs

arxiv url: http://arxiv.org/abs/2605.12586v1
Date: Tue, 12 May 2026 17:57:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.595666
Title: 3D Primitives are a Spatial Language for VLMs
Title（参考訳）: 3DプリミティブはVLMのための空間言語である
Authors: Junze Liu, Kun Qian, Florian Dubost, Kai Zhong, Arvind Srinivasan, Nan Chen, Anping Wang, Sam Zhang, Alejandro Mottini, Qingjun Cui, Tian Wang,
Abstract要約: 視覚言語モデルは、正確なオブジェクト数、クラス、近似位置を持つ幾何学的プリミティブから3Dシーンを再構成するコードを生成することができるが、同じモデルは同じ画像上のより単純な空間的問題で失敗する。 3次元幾何学的プリミティブは空間的理解のための強力な中間表現として機能し、3つのコントリビューションを通じてこれを活用できることを示す。
参考スコア（独自算出の注目度）: 45.036016381384336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) exhibit a striking paradox: they can generate executable code that reconstructs a 3D scene from geometric primitives with correct object counts, classes, and approximate positions, yet the same models fail at simpler spatial questions on the same image. We show that 3D geometric primitives (cubes, spheres, cylinders, expressed in executable code) serve as a powerful intermediate representation for spatial understanding, and exploit this through three contributions. First, we introduce \textbf{\textsc{SpatialBabel}}, a benchmark evaluating fourteen VLMs on primitive-based 3D scene reconstruction across six \emph{scene-code languages} (programming languages and declarative formats for 3D primitive scenes), revealing that a single model's object-detection F1 can vary by up to $5.7\times$ across languages. Second, we propose \textbf{Code-CoT} (Code Chain-of-Thought), a training-free inference strategy that routes spatial reasoning through primitive-based code generation. Code-CoT lifts the SpatialBabel-QA-Score by up to $+6.4$\% on primitive scenes and real-photo CV-Bench-3D accuracy by $+5.0$\% for VLMs with strong coding capabilities. Third, we propose \textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning), which self-supervisedly distills primitive spatial knowledge into general visual reasoning by parsing the model's own Three.js primitive-reconstructions into structured annotations and fine-tuning on the result, with \emph{no human labels and no teacher model}. Training on primitive images alone, S$^3$-FT improves Qwen3-VL-8B by $+4.6$ to $+8.6$\% on SpatialBabel-Primitive-QA, $+9.7$\% on CV-Bench-2D, and $+17$\% on HallusionBench; the recipe transfers across model families. These results establish geometric primitives in code as both a diagnostic and a transferable spatial vocabulary for VLMs. We will release all artifacts upon publication.
Abstract（参考訳）: 視覚言語モデル(VLM)は、正確なオブジェクト数、クラス、近似位置を持つ幾何学的プリミティブから3Dシーンを再構成する実行可能なコードを生成することができるが、同じモデルが同じ画像上のより単純な空間的問題で失敗する。 3次元幾何学的プリミティブ(キューブ、球面、シリンダー、実行可能コードで表されるシリンダー)が空間理解のための強力な中間表現として機能し、これを3つのコントリビューションを通じて活用することを示す。まず,6つの 'emph{scene-code language} (言語と3Dプリミティブシーンの宣言的形式をプログラムする言語) におけるプリミティブベース3Dシーン再構築に関する14のVLMを評価するベンチマークである \textbf{\textsc{SpatialBabel}} を紹介する。第二に,プリミティブなコード生成を通じて空間的推論をルーティングするトレーニング不要な推論手法である \textbf{Code-CoT} (Code Chain-of-Thought) を提案する。 Code-CoTは、SpatialBabel-QA-Scoreをプリミティブシーンで最大$6.4$\%、強力なコーディング機能を持つVLMで$5.0$\%のリアルタイムCV-Bench-3D精度で持ち上げる。第3に、モデル自身の3.jsプリミティブ再構成を構造化アノテーションにパースし、その結果を微調整することにより、プリミティブな空間的知識を一般的な視覚的推論に自己監督的に蒸留する「textbf{S$^{3}$-FT} (Self-Supervised Spatial Fine-Tuning)」を提案する。 S$^3$-FTは、プリミティブイメージのみをトレーニングし、Qwen3-VL-8Bを$+4.6$から$+8.6$\%、SpatialBabel-Primitive-QAで$+9.7$\%、CV-Bench-2Dで$+17$\%、HalusionBenchで$+17$\%改善する。これらの結果は、VLMの診断および伝達可能な空間語彙として、コードの幾何学的プリミティブを確立している。すべてのアーティファクトを出版時に公開します。

論文の概要: 3D Primitives are a Spatial Language for VLMs

関連論文リスト