Fugu-MT 論文翻訳(概要): SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

論文の概要: SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

arxiv url: http://arxiv.org/abs/2509.15693v1
Date: Fri, 19 Sep 2025 07:13:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:11.044889
Title: SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions
Title（参考訳）: SCENEFORGE: 構造化シーン構成による3Dテキストアライメントの強化
Authors: Cristian Sbrolli, Matteo Matteucci,
Abstract要約: SceneForgeは、構造化されたマルチオブジェクトシーンコンポジションを通じて、3Dポイントクラウドとテキスト間のコントラストアライメントを強化するフレームワークである。構造化された構成サンプルによる対照的なトレーニングを強化することで、SceneForgeは大規模な3Dテキストデータセットの不足に効果的に対処する。
参考スコア（独自算出の注目度）: 9.41365281895669
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
Abstract（参考訳）: 全体は3Dテキストのコントラスト学習における部分の総和よりも大きい。 SceneForgeは3Dポイントクラウドとテキスト間のコントラストアライメントを、構造化されたマルチオブジェクトシーンコンポジションを通じて強化する新しいフレームワークである。 SceneForgeは、個々の3D形状を活用して、空間的関係を明確にしたマルチオブジェクトシーンを構築し、大きな言語モデルによって洗練されたコヒーレントなマルチオブジェクト記述と組み合わせる。これらの構造化された合成サンプルによる対照的なトレーニングを強化することで、SceneForgeは大規模な3Dテキストデータセットの不足に対処し、データの複雑さと多様性を大幅に強化する。本研究では,シーンごとのオブジェクトの最適数,トレーニングバッチにおける構成サンプルの割合,シーン構築戦略などの重要な設計要素を体系的に検討する。大規模な実験では、SceneForgeはModelNet、ScanObjNN、Objaverse-LVIS、ScanNetのゼロショット分類やShapeNetPartの少数ショット部分のセグメンテーションなど、複数のタスクで大幅なパフォーマンス向上を実現している。 SceneForgeのコンポジション拡張はモデルに依存しず、複数のエンコーダアーキテクチャのパフォーマンスを一貫して改善している。さらに、SceneForgeは、ScanQA上での3次元視覚的質問応答を改善し、シーンの複雑さを増大させるような検索シナリオを強力に一般化し、空間的構成を適応させてテキスト命令と正確に整合させることにより、空間的推論能力を示す。

論文の概要: SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

関連論文リスト