COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval
- URL: http://arxiv.org/abs/2405.04103v1
- Date: Tue, 7 May 2024 08:16:13 GMT
- Title: COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval
- Authors: Hao Wu, Ruochong LI, Hao Wang, Hui Xiong,
- Abstract summary: We propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance.
Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes.
Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method.
- Score: 21.070154402838906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
Related papers
- Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction [4.820576346277399]
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities.
We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment.
Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
arXiv Detail & Related papers (2025-04-02T08:29:42Z) - Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces [52.237827968294766]
We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance.
We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher.
arXiv Detail & Related papers (2025-03-07T09:51:56Z) - SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation [21.070154402838906]
Cross-modal 3D retrieval aims to achieve mutual matching between text descriptions and 3D shapes.
The scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods.
We introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval.
arXiv Detail & Related papers (2025-02-26T13:36:40Z) - HOTS3D: Hyper-Spherical Optimal Transport for Semantic Alignment of Text-to-3D Generation [15.34704512558617]
Recent CLIP-guided 3D generation methods have achieved promising results but struggle with generating faithful 3D shapes that conform with input text.
This paper proposes HOTS3D which makes the first attempt to effectively bridge this gap by aligning text features to the image features with spherical optimal transport (SOT)
With the optimally mapped features, a diffusion-based generator and a Nerf-based decoder are subsequently utilized to transform them into 3D shapes.
arXiv Detail & Related papers (2024-07-19T15:43:24Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Wonder3D: Single Image to 3D using Cross-Domain Diffusion [105.16622018766236]
Wonder3D is a novel method for efficiently generating high-fidelity textured meshes from single-view images.
To holistically improve the quality, consistency, and efficiency of image-to-3D tasks, we propose a cross-domain diffusion model.
arXiv Detail & Related papers (2023-10-23T15:02:23Z) - High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views.
Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z) - 3D Shape Knowledge Graph for Cross-domain 3D Shape Retrieval [20.880210749809642]
"geometric words" function as elemental constituents for representing entities through combinations.
Each 3D or 2D entity can anchor its geometric terms within the knowledge graph, thereby serving as a link between cross-domain data.
We evaluate the proposed method's performance on the ModelNet40 and ShapeNetCore55 datasets.
arXiv Detail & Related papers (2022-10-27T02:51:24Z) - TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval [15.692019545368844]
Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data.
Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification.
We propose a trimodal learning scheme over text, multi-view images and 3D shape voxels, and show that with large batch contrastive learning we achieve good performance on text-to-shape retrieval without complex attention mechanisms or losses.
arXiv Detail & Related papers (2022-01-19T00:15:15Z) - Geometry-Contrastive Transformer for Generalized 3D Pose Transfer [95.56457218144983]
The intuition of this work is to perceive the geometric inconsistency between the given meshes with the powerful self-attention mechanism.
We propose a novel geometry-contrastive Transformer that has an efficient 3D structured perceiving ability to the global geometric inconsistencies.
We present a latent isometric regularization module together with a novel semi-synthesized dataset for the cross-dataset 3D pose transfer task.
arXiv Detail & Related papers (2021-12-14T13:14:24Z) - Improving 3D Object Detection with Channel-wise Transformer [58.668922561622466]
We propose a two-stage 3D object detection framework (CT3D) with minimal hand-crafted design.
CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation.
It achieves the AP of 81.77% in the moderate car category on the KITTI test 3D detection benchmark.
arXiv Detail & Related papers (2021-08-23T02:03:40Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.