Uni3D: Exploring Unified 3D Representation at Scale
- URL: http://arxiv.org/abs/2310.06773v1
- Date: Tue, 10 Oct 2023 16:49:21 GMT
- Title: Uni3D: Exploring Unified 3D Representation at Scale
- Authors: Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang,
Xinlong Wang
- Abstract summary: We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
- Score: 66.26710717073372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling up representations for images or text has been extensively
investigated in the past few years and has led to revolutions in learning
vision and language. However, scalable representation for 3D objects and scenes
is relatively unexplored. In this work, we present Uni3D, a 3D foundation model
to explore the unified 3D representation at scale. Uni3D uses a 2D initialized
ViT end-to-end pretrained to align the 3D point cloud features with the
image-text aligned features. Via the simple architecture and pretext task,
Uni3D can leverage abundant 2D pretrained models as initialization and
image-text aligned models as the target, unlocking the great potential of 2D
models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D
to one billion parameters, and set new records on a broad range of 3D tasks,
such as zero-shot classification, few-shot classification, open-world
understanding and part segmentation. We show that the strong Uni3D
representation also enables applications such as 3D painting and retrieval in
the wild. We believe that Uni3D provides a new direction for exploring both
scaling up and efficiency of the representation in 3D domain.
Related papers
- Learning 3D Representations from Procedural 3D Programs [6.915871213703219]
Self-supervised learning has emerged as a promising approach for acquiring transferable 3D representations from unlabeled 3D point clouds.
We propose learning 3D representations from procedural 3D programs that automatically generate 3D shapes using simple primitives and augmentations.
arXiv Detail & Related papers (2024-11-25T18:59:57Z) - ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? [111.11502241431286]
Vision Transformers (ViTs) have proven to be effective in solving 2D image understanding tasks.
ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable.
This paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture.
arXiv Detail & Related papers (2022-09-15T03:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.