ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
- URL: http://arxiv.org/abs/2507.11261v2
- Date: Sun, 27 Jul 2025 06:20:54 GMT
- Title: ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
- Authors: Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He,
- Abstract summary: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions.<n>We propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process.<n> Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods.
- Score: 34.39212457455039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.
Related papers
- HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model [14.277165215664425]
Large vision-language models (VLMs) have shown significant promise for 3D scene understanding.<n>Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space.<n>We propose a novel hierarchical multimodal representation for 3D scene reasoning.
arXiv Detail & Related papers (2025-11-28T08:06:20Z) - MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation [16.539993197236125]
We present MetaFind, a scene-aware tri-modal compositional retrieval framework.<n>It is designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories.
arXiv Detail & Related papers (2025-10-05T06:37:26Z) - SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions [9.41365281895669]
SceneForge is a framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions.<n>By augmenting contrastive training with structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets.
arXiv Detail & Related papers (2025-09-19T07:13:45Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering [106.96097136553105]
3D Question Answering (3D QA) requires the model to understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation.<n>Existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images.<n>We propose a Dual-vision Scene Perception Network (DSPNet) to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA.
arXiv Detail & Related papers (2025-03-05T05:13:53Z) - CrossOver: 3D Scene Cross-Modal Alignment [78.3057713547313]
CrossOver is a novel framework for cross-modal 3D scene understanding.<n>It learns a unified, modality-agnostic embedding space for scenes by aligning modalities.<n>It supports robust scene retrieval and object localization, even with missing modalities.
arXiv Detail & Related papers (2025-02-20T20:05:30Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - CVSformer: Cross-View Synthesis Transformer for Semantic Scene
Completion [0.0]
We propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships.
We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels.
We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
arXiv Detail & Related papers (2023-07-16T04:08:03Z) - MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z) - TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data.
We propose a TransRefer3D network to extract entity-and-relation aware multimodal context.
Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z) - Descriptor-Free Multi-View Region Matching for Instance-Wise 3D
Reconstruction [34.21773285521006]
We propose a multi-view region matching method based on epipolar geometry.
We show that the epipolar region matching can be easily integrated into instance segmentation and effective for instance-wise 3D reconstruction.
arXiv Detail & Related papers (2020-11-27T10:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.