Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D
Object Sets
- URL: http://arxiv.org/abs/2209.00682v1
- Date: Thu, 1 Sep 2022 18:36:43 GMT
- Title: Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D
Object Sets
- Authors: Kristofer Schlachter, Benjamin Ahlbrand, Zhu Wang, Valerio Ortenzi,
Ken Perlin
- Abstract summary: High-quality 3D asset retrieval from multi-modal inputs, including 2D sketches, images and text.
We use CLIP as it provides a bridge to higher-level latent features.
We use these features to perform a multi-modality fusion to address the lack of artistic control that affects common data-driven approaches.
- Score: 4.2880616924515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When creating 3D content, highly specialized skills are generally needed to
design and generate models of objects and other assets by hand. We address this
problem through high-quality 3D asset retrieval from multi-modal inputs,
including 2D sketches, images and text. We use CLIP as it provides a bridge to
higher-level latent features. We use these features to perform a multi-modality
fusion to address the lack of artistic control that affects common data-driven
approaches. Our approach allows for multi-modal conditional feature-driven
retrieval through a 3D asset database, by utilizing a combination of input
latent embeddings. We explore the effects of different combinations of feature
embeddings across different input types and weighting methods.
Related papers
- Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model [35.184607650708784]
Articulate-Anything automates the articulation of diverse, complex objects from many input modalities, including text, images, and videos.
Our system exploits existing 3D asset datasets via a mesh retrieval mechanism, along with an actor-critic system that iteratively proposes, evaluates, and refines solutions.
arXiv Detail & Related papers (2024-10-03T19:42:16Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Deep Models for Multi-View 3D Object Recognition: A Review [16.500711021549947]
Multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance.
This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks.
arXiv Detail & Related papers (2024-04-23T16:54:31Z) - ComboVerse: Compositional 3D Assets Creation Using Spatially-Aware Diffusion Guidance [76.7746870349809]
We present ComboVerse, a 3D generation framework that produces high-quality 3D assets with complex compositions by learning to combine multiple models.
Our proposed framework emphasizes spatial alignment of objects, compared with standard score distillation sampling.
arXiv Detail & Related papers (2024-03-19T03:39:43Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval [8.74845857766369]
Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
arXiv Detail & Related papers (2023-07-20T05:46:32Z) - UniG3D: A Unified 3D Object Generation Dataset [75.49544172927749]
UniG3D is a unified 3D object generation dataset constructed by employing a universal data transformation pipeline on ShapeNet datasets.
This pipeline converts each raw 3D model into comprehensive multi-modal data representation.
The selection of data sources for our dataset is based on their scale and quality.
arXiv Detail & Related papers (2023-06-19T07:03:45Z) - SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation [89.47132156950194]
We present a novel framework built to simplify 3D asset generation for amateur users.
Our method supports a variety of input modalities that can be easily provided by a human.
Our model can combine all these tasks into one swiss-army-knife tool.
arXiv Detail & Related papers (2022-12-08T18:59:05Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Multimodal Semi-Supervised Learning for 3D Objects [19.409295848915388]
This paper explores how the coherence of different modelities of 3D data can be used to improve data efficiency for both 3D classification and retrieval tasks.
We propose a novel multimodal semi-supervised learning framework by introducing instance-level consistency constraint and a novel multimodal contrastive prototype (M2CP) loss.
Our proposed framework significantly outperforms all the state-of-the-art counterparts for both classification and retrieval tasks by a large margin on the modelNet10 and ModelNet40 datasets.
arXiv Detail & Related papers (2021-10-22T05:33:16Z) - 3D-MAN: 3D Multi-frame Attention Network for Object Detection [22.291051951077485]
3D-MAN is a 3D multi-frame attention network that effectively aggregates features from multiple perspectives.
We show that 3D-MAN achieves state-of-the-art results compared to published single-frame and multi-frame methods.
arXiv Detail & Related papers (2021-03-30T03:44:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.