OpenShape: Scaling Up 3D Shape Representation Towards Open-World
Understanding
- URL: http://arxiv.org/abs/2305.10764v2
- Date: Fri, 16 Jun 2023 23:31:40 GMT
- Title: OpenShape: Scaling Up 3D Shape Representation Towards Open-World
Understanding
- Authors: Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li,
Shizhong Han, Hong Cai, Fatih Porikli, Hao Su
- Abstract summary: We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds.
We scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions.
We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition.
- Score: 53.21204584976076
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce OpenShape, a method for learning multi-modal joint
representations of text, image, and point clouds. We adopt the commonly used
multi-modal contrastive learning framework for representation alignment, but
with a specific focus on scaling up 3D representations to enable open-world 3D
shape understanding. To achieve this, we scale up training data by ensembling
multiple 3D datasets and propose several strategies to automatically filter and
enrich noisy text descriptions. We also explore and compare strategies for
scaling 3D backbone networks and introduce a novel hard negative mining module
for more efficient training. We evaluate OpenShape on zero-shot 3D
classification benchmarks and demonstrate its superior capabilities for
open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy
of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than
10% for existing methods. OpenShape also achieves an accuracy of 85.3% on
ModelNet40, outperforming previous zero-shot baseline methods by 20% and
performing on par with some fully-supervised methods. Furthermore, we show that
our learned embeddings encode a wide range of visual and semantic concepts
(e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D
and image-3D interactions. Due to their alignment with CLIP embeddings, our
learned shape representations can also be integrated with off-the-shelf
CLIP-based models for various applications, such as point cloud captioning and
point cloud-conditioned image generation.
Related papers
- OpenDlign: Open-World Point Cloud Understanding with Depth-Aligned Images [17.344430840048094]
We present OpenDlign, a novel open-world 3D model using depth-aligned images for robust multimodal alignment.
OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters.
arXiv Detail & Related papers (2024-04-25T11:53:36Z) - Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels [69.55622471172941]
Large-scale vision 2D vision language models, such as CLIP can be aligned with a 3D encoder to learn generalizable (open-vocabulary) 3D vision models.
We propose an optimization framework: Cross-MoST: Cross-Modal Self-Training, to improve the label-free classification performance of a zero-shot 3D vision model.
arXiv Detail & Related papers (2024-04-15T21:30:50Z) - TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding [28.112402580426174]
TriAdapter Multi-Modal Learning (TAMM) is a novel two-stage learning approach based on three synergistic adapters.
TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks.
arXiv Detail & Related papers (2024-02-28T17:18:38Z) - MV-CLIP: Multi-View CLIP for Zero-shot 3D Shape Recognition [49.52436478739151]
Large-scale pre-trained models have demonstrated impressive performance in vision and language tasks within open-world scenarios.
Recent methods utilize language-image pre-training to realize zero-shot 3D shape recognition.
This paper aims to improve the confidence with view selection and hierarchical prompts.
arXiv Detail & Related papers (2023-11-30T09:51:53Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.