Fugu-MT 論文翻訳(概要): GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

論文の概要: GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

arxiv url: http://arxiv.org/abs/2605.28995v1
Date: Wed, 27 May 2026 18:53:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.240174
Title: GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation
Title（参考訳）: GAP3D:3次元生成のためのパッチレベル埋め込みへのVLM潜伏剤の生成アライメント
Authors: Polytimi Anna Gkotsi, Andrii Zadaianchuk, Mohammad Mahdi Derakhshani,
Abstract要約: GAP3Dはモジュラーで拡散に基づくアプローチで、VLMラテントを事前訓練された画像エンコーダの完全なパッチレベルの特徴空間に直接アライメントする。本手法は,汎用画像とテキストのペアを主にトレーニングすることで,大規模3Dデータの必要性を回避している。また、テキスト入力のみにトレーニングされているにもかかわらず、マルチモーダルプロンプトの緊急ゼロショット機能も備えている。
参考スコア（独自算出の注目度）: 9.608873992799511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent approaches integrating vision-language models (VLMs) as prompt encoders for generative model conditioning typically rely on expensive end-to-end training or map features to compressed representations, discarding the dense spatial structure required for geometry-aware tasks like 3D asset generation. To address this, we propose GAP3D, a modular, diffusion-based approach that aligns VLM-generated latents directly to the complete, patch-level feature space of a pre-trained image encoder, enabling a frozen downstream generative model to utilize a VLM as prompt encoder while maintaining a spatially structured conditioning signal. Evaluated on 3D asset generation, our method bypasses the need for large-scale 3D data by training mainly on general-domain image-text pairs. It also exhibits emergent zero-shot capabilities for multimodal prompts, despite being trained exclusively on text input. Finally, while currently prioritizing high-level semantics over fine-grained detail, GAP3D demonstrates that the representation gap between VLM and image-encoder feature spaces can be partially bridged through diffusion-based alignment, taking the first steps towards a modular integration of foundation models through generative alignment to dense embedding spaces.
Abstract（参考訳）: 生成モデル条件付けのプロンプトエンコーダとして視覚言語モデル(VLM)を統合する最近のアプローチは、一般的に高価なエンドツーエンドのトレーニングや圧縮表現へのマップ機能に依存しており、3Dアセット生成のような幾何学的タスクに必要な密集した空間構造を捨てている。そこで本研究では,VLM生成したラテントを事前学習した画像エンコーダの完全なパッチレベル特徴空間に直接整合させるモジュール型拡散型アプローチであるGAP3Dを提案し,空間的に構造化されたコンディショニング信号を維持しつつ,VLMをプロンプトエンコーダとして利用できるようにした。提案手法は,3次元アセット生成に基づいて評価され,一般領域の画像テキストペアを中心にトレーニングすることで,大規模3次元データの必要性を回避している。また、テキスト入力のみにトレーニングされているにもかかわらず、マルチモーダルプロンプトの緊急ゼロショット機能も備えている。最後に、現在高階意味論を詳細に優先順位付けしているが、GAP3Dは、VLMと画像エンコーダ特徴空間の間の表現ギャップが拡散ベースのアライメントによって部分的に橋渡し可能であることを証明し、高密度埋め込み空間への生成的アライメントを通じて基礎モデルのモジュラー統合に向けた第一歩を踏み出した。

論文の概要: GAP3D: Generative Alignment of VLM Latents to Patch-Level Embeddings for 3D Generation

関連論文リスト