TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
- URL: http://arxiv.org/abs/2201.07366v2
- Date: Wed, 27 Dec 2023 15:07:03 GMT
- Title: TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
- Authors: Yue Ruan, Han-Hung Lee, Yiming Zhang, Ke Zhang, Angel X. Chang
- Abstract summary: Text-to-shape retrieval is an increasingly relevant problem with the growth of 3D shape data.
Recent work on contrastive losses for learning joint embeddings over multimodal data has been successful at tasks such as retrieval and classification.
We propose a trimodal learning scheme over text, multi-view images and 3D shape voxels, and show that with large batch contrastive learning we achieve good performance on text-to-shape retrieval without complex attention mechanisms or losses.
- Score: 15.692019545368844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-shape retrieval is an increasingly relevant problem with the growth
of 3D shape data. Recent work on contrastive losses for learning joint
embeddings over multimodal data has been successful at tasks such as retrieval
and classification. Thus far, work on joint representation learning for 3D
shapes and text has focused on improving embeddings through modeling of complex
attention between representations, or multi-task learning. We propose a
trimodal learning scheme over text, multi-view images and 3D shape voxels, and
show that with large batch contrastive learning we achieve good performance on
text-to-shape retrieval without complex attention mechanisms or losses. Our
experiments serve as a foundation for follow-up work on building trimodal
embeddings for text-image-shape.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - COM3D: Leveraging Cross-View Correspondence and Cross-Modal Mining for 3D Retrieval [21.070154402838906]
We propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance.
Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes.
Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method.
arXiv Detail & Related papers (2024-05-07T08:16:13Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation.
A lightweight network is developed to efficiently acquire feature volumes from multi-view images.
The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation [89.47132156950194]
We present a novel framework built to simplify 3D asset generation for amateur users.
Our method supports a variety of input modalities that can be easily provided by a human.
Our model can combine all these tasks into one swiss-army-knife tool.
arXiv Detail & Related papers (2022-12-08T18:59:05Z) - 3D Shape Knowledge Graph for Cross-domain 3D Shape Retrieval [20.880210749809642]
"geometric words" function as elemental constituents for representing entities through combinations.
Each 3D or 2D entity can anchor its geometric terms within the knowledge graph, thereby serving as a link between cross-domain data.
We evaluate the proposed method's performance on the ModelNet40 and ShapeNetCore55 datasets.
arXiv Detail & Related papers (2022-10-27T02:51:24Z) - Hard Example Generation by Texture Synthesis for Cross-domain Shape
Similarity Learning [97.56893524594703]
Image-based 3D shape retrieval (IBSR) aims to find the corresponding 3D shape of a given 2D image from a large 3D shape database.
metric learning with some adaptation techniques seems to be a natural solution to shape similarity learning.
We develop a geometry-focused multi-view metric learning framework empowered by texture synthesis.
arXiv Detail & Related papers (2020-10-23T08:52:00Z) - Info3D: Representation Learning on 3D Objects using Mutual Information
Maximization and Contrastive Learning [8.448611728105513]
We propose to extend the InfoMax and contrastive learning principles on 3D shapes.
We show that we can maximize the mutual information between 3D objects and their "chunks" to improve the representations in aligned datasets.
arXiv Detail & Related papers (2020-06-04T00:30:26Z) - Self-Supervised 2D Image to 3D Shape Translation with Disentangled
Representations [92.89846887298852]
We present a framework to translate between 2D image views and 3D object shapes.
We propose SIST, a Self-supervised Image to Shape Translation framework.
arXiv Detail & Related papers (2020-03-22T22:44:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.