Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
- URL: http://arxiv.org/abs/2209.07026v2
- Date: Sun, 18 Sep 2022 00:48:27 GMT
- Title: Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
- Authors: Yi Wang and Zhiwen Fan and Tianlong Chen and Hehe Fan and Zhangyang
Wang
- Abstract summary: Vision Transformers (ViTs) have proven to be effective in solving 2D image understanding tasks.
ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable.
This paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture.
- Score: 111.11502241431286
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Transformers (ViTs) have proven to be effective, in solving 2D image
understanding tasks by training over large-scale image datasets; and meanwhile
as a somehow separate track, in modeling the 3D visual world too such as voxels
or point clouds. However, with the growing hope that transformers can become
the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks
have so far adopted vastly different architecture designs that are hardly
transferable. That invites an (over-)ambitious question: can we close the gap
between the 2D and 3D ViT architectures? As a piloting study, this paper
demonstrates the appealing promise to understand the 3D visual world, using a
standard 2D ViT architecture, with only minimal customization at the input and
output levels without redesigning the pipeline. To build a 3D ViT from its 2D
sibling, we "inflate" the patch embedding and token sequence, accompanied with
new positional encoding mechanisms designed to match the 3D data geometry. The
resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly
robustly on popular 3D tasks such as object classification, point cloud
segmentation and indoor scene detection, compared to highly customized
3D-specific designs. It can hence act as a strong baseline for new 3D ViTs.
Moreover, we note that pursing a unified 2D-3D ViT design has practical
relevance besides just scientific curiosity. Specifically, we demonstrate that
Simple3D-Former naturally enables to exploit the wealth of pre-trained weights
from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in
to enhancing the 3D task performance "for free".
Related papers
- Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D [95.14469865815768]
2D vision models can be used for semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets.
However, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task.
In this paper, we propose Lift3D, which trains to predict unseen views on feature spaces generated by a few visual models.
We even outperform state-of-the-art methods specialized for the task in question.
arXiv Detail & Related papers (2024-03-27T18:13:16Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - M3D-VTON: A Monocular-to-3D Virtual Try-On Network [62.77413639627565]
Existing 3D virtual try-on methods mainly rely on annotated 3D human shapes and garment templates.
We propose a novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the merits of both 2D and 3D approaches.
arXiv Detail & Related papers (2021-08-11T10:05:17Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.