Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding
- URL: http://arxiv.org/abs/2508.19165v1
- Date: Tue, 26 Aug 2025 16:13:18 GMT
- Title: Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding
- Authors: Yuzhen Li, Min Liu, Yuan Bian, Xueping Wang, Zhaoyang Li, Gen Li, Yaonan Wang,
- Abstract summary: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information.<n>We propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods.
- Score: 46.331376542148696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit "meter" to "decimeters" or "centimeters" leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94\% in the "Far" scenario. Our code will be made publicly available.
Related papers
- Textured Geometry Evaluation: Perceptual 3D Textured Shape Metric via 3D Latent-Geometry Network [47.25343775786203]
Existing metrics, such as Chamfer Distance, often fail to align with how humans evaluate the fidelity of 3D shapes.<n>Recent learning-based metrics attempt to improve this by relying on rendered images and 2D image quality metrics.<n>We propose a new fidelity evaluation method that is based directly on 3D meshes with texture, without relying on rendering.
arXiv Detail & Related papers (2025-12-01T07:53:03Z) - IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction [82.53307702809606]
Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions.<n>We propose InstanceGrounded Geometry Transformer (IGGT) to unify the knowledge for both spatial reconstruction and instance-level contextual understanding.
arXiv Detail & Related papers (2025-10-26T14:57:44Z) - 3DGeoDet: General-purpose Geometry-aware Image-based 3D Object Detection [17.502554516157893]
3DGeoDet is a novel geometry-aware 3D object detection approach.<n>It effectively handles single- and multi-view RGB images in indoor and outdoor environments.
arXiv Detail & Related papers (2025-06-11T09:18:36Z) - Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces [52.237827968294766]
We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance.<n>We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher.<n>Ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces.
arXiv Detail & Related papers (2025-03-07T09:51:56Z) - SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation [21.070154402838906]
Cross-modal 3D retrieval aims to achieve mutual matching between text descriptions and 3D shapes.<n>The scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods.<n>We introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval.
arXiv Detail & Related papers (2025-02-26T13:36:40Z) - Mono3DVG: 3D Visual Grounding in Monocular Images [12.191320182791483]
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information.
We build a large-scale dataset, Mono3DRefer, which contains 3D object targets with corresponding geometric text descriptions.
We propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings.
arXiv Detail & Related papers (2023-12-13T09:49:59Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - Directional Texture Editing for 3D Models [51.31499400557996]
ITEM3D is designed for automatic textbf3D object editing according to the text textbfInstructions.
Leveraging the diffusion models and the differentiable rendering, ITEM3D takes the rendered images as the bridge of text and 3D representation.
arXiv Detail & Related papers (2023-09-26T12:01:13Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.