SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions
- URL: http://arxiv.org/abs/2509.15693v2
- Date: Thu, 16 Oct 2025 13:14:57 GMT
- Title: SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions
- Authors: Cristian Sbrolli, Matteo Matteucci,
- Abstract summary: SceneForge is a framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions.<n>By augmenting contrastive training with structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets.
- Score: 9.41365281895669
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.
Related papers
- Hierarchical Image-Guided 3D Point Cloud Segmentation in Industrial Scenes via Multi-View Bayesian Fusion [4.679314646805623]
3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects.<n>Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views.<n>We propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level.
arXiv Detail & Related papers (2025-12-07T15:15:52Z) - MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation [16.539993197236125]
We present MetaFind, a scene-aware tri-modal compositional retrieval framework.<n>It is designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories.
arXiv Detail & Related papers (2025-10-05T06:37:26Z) - ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition [34.39212457455039]
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions.<n>We propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process.<n> Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-07-15T12:35:01Z) - TextSplat: Text-Guided Semantic Fusion for Generalizable Gaussian Splatting [46.753153357441505]
Generalizable Gaussian Splatting has enabled robust 3D reconstruction from sparse input views.<n>We propose TextSplat--the first text-driven Generalizable Gaussian Splatting framework.
arXiv Detail & Related papers (2025-04-13T14:14:10Z) - CrossOver: 3D Scene Cross-Modal Alignment [78.3057713547313]
CrossOver is a novel framework for cross-modal 3D scene understanding.<n>It learns a unified, modality-agnostic embedding space for scenes by aligning modalities.<n>It supports robust scene retrieval and object localization, even with missing modalities.
arXiv Detail & Related papers (2025-02-20T20:05:30Z) - BloomScene: Lightweight Structured 3D Gaussian Splatting for Crossmodal Scene Generation [54.12899218104669]
3D scenes have highly complex structures and need to ensure that the output is dense, coherent, and contains all necessary structures.<n>Current 3D scene generation methods rely on pre-trained text-to-image diffusion models and monocular depth estimators.<n>We propose BloomScene, a lightweight structured 3D Gaussian splatting for crossmodal scene generation.
arXiv Detail & Related papers (2025-01-15T11:33:34Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.