Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
- URL: http://arxiv.org/abs/2504.01476v1
- Date: Wed, 02 Apr 2025 08:29:42 GMT
- Title: Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
- Authors: Junlong Ren, Hao Wang,
- Abstract summary: Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities.<n>We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment.<n>Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
- Score: 4.820576346277399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. Current methods predominantly rely on a certain 3D representation (e.g., point cloud), with few exploiting the 2D-3D consistency and complementary relationships, which constrains their performance. To bridge this gap, we propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment (i.e., image, point, text) for enhanced cross-modal 3D retrieval. Notably, we introduce tri-modal reconstruction to improve the generalization ability of encoders. Given point features, we reconstruct image features under the guidance of text features, and vice versa. With well-aligned point cloud and multi-view image features, we aggregate them as multimodal embeddings through fine-grained 2D-3D fusion to enhance geometric and semantic understanding. Recognizing the significant noise in current datasets where many 3D shapes and texts share similar semantics, we employ hard negative contrastive training to emphasize harder negatives with greater significance, leading to robust discriminative embeddings. Extensive experiments on the Text2Shape dataset demonstrate that our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks by a substantial margin.
Related papers
- SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation [21.070154402838906]
Cross-modal 3D retrieval aims to achieve mutual matching between text descriptions and 3D shapes.<n>The scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods.<n>We introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval.
arXiv Detail & Related papers (2025-02-26T13:36:40Z) - Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval [21.070154402838906]
We propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance.
Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes.
Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method.
arXiv Detail & Related papers (2024-05-07T08:16:13Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - Wonder3D: Single Image to 3D using Cross-Domain Diffusion [105.16622018766236]
Wonder3D is a novel method for efficiently generating high-fidelity textured meshes from single-view images.
To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model.
arXiv Detail & Related papers (2023-10-23T15:02:23Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models.
Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z) - TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval [15.692019545368844]
Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data.
Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification.
We propose a trimodal learning scheme over text, multi-view images and 3D shape voxels, and show that with large batch contrastive learning we achieve good performance on text-to-shape retrieval without complex attention mechanisms or losses.
arXiv Detail & Related papers (2022-01-19T00:15:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.