Related papers: ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

URL: http://arxiv.org/abs/2507.11261v2
Date: Sun, 27 Jul 2025 06:20:54 GMT
Title: ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition
Authors: Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, Shengfeng He,
Abstract summary: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions.<n>We propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process.<n> Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods.
Score: 34.39212457455039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations. To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding. Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation. Code is available at https://github.com/visualjason/ViewSRD.

Related papers

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model [14.277165215664425]
Large vision-language models (VLMs) have shown significant promise for 3D scene understanding.<n>Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space.<n>We propose a novel hierarchical multimodal representation for 3D scene reasoning.
arXiv Detail & Related papers (2025-11-28T08:06:20Z)
MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation [16.539993197236125]
We present MetaFind, a scene-aware tri-modal compositional retrieval framework.<n>It is designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories.
arXiv Detail & Related papers (2025-10-05T06:37:26Z)
SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions [9.41365281895669]
SceneForge is a framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions.<n>By augmenting contrastive training with structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets.
arXiv Detail & Related papers (2025-09-19T07:13:45Z)
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z)
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering [106.96097136553105]
3D Question Answering (3D QA) requires the model to understand its situated 3D scene described by the text, then reason about its surrounding environment and answer a question under that situation.<n>Existing methods usually rely on global scene perception from pure 3D point clouds and overlook the importance of rich local texture details from multi-view images.<n>We propose a Dual-vision Scene Perception Network (DSPNet) to comprehensively integrate multi-view and point cloud features to improve robustness in 3D QA.
arXiv Detail & Related papers (2025-03-05T05:13:53Z)
CrossOver: 3D Scene Cross-Modal Alignment [78.3057713547313]
CrossOver is a novel framework for cross-modal 3D scene understanding.<n>It learns a unified, modality-agnostic embedding space for scenes by aligning modalities.<n>It supports robust scene retrieval and object localization, even with missing modalities.
arXiv Detail & Related papers (2025-02-20T20:05:30Z)
SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles. Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z)
CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion [0.0]
We propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels. We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results.
arXiv Detail & Related papers (2023-07-16T04:08:03Z)
MMRDN: Consistent Representation for Multi-View Manipulation Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN) We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions. We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z)
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding. For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions. In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z)
Support-set based Multi-modal Representation Enhancement for Video Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z)
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data. We propose a TransRefer3D network to extract entity-and-relation aware multimodal context. Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z)
Descriptor-Free Multi-View Region Matching for Instance-Wise 3D Reconstruction [34.21773285521006]
We propose a multi-view region matching method based on epipolar geometry. We show that the epipolar region matching can be easily integrated into instance segmentation and effective for instance-wise 3D reconstruction.
arXiv Detail & Related papers (2020-11-27T10:45:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.