Scalable 3D Captioning with Pretrained Models
- URL: http://arxiv.org/abs/2306.07279v2
- Date: Fri, 16 Jun 2023 03:58:15 GMT
- Title: Scalable 3D Captioning with Pretrained Models
- Authors: Tiange Luo, Chris Rockwell, Honglak Lee, Justin Johnson
- Abstract summary: Cap3D is an automatic approach for generating descriptive text for 3D objects.
We apply Cap3D to the recently introduced large-scale 3D dataset.
Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human descriptions in terms of quality, cost, and speed.
- Score: 63.16604472745202
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Cap3D, an automatic approach for generating descriptive text for
3D objects. This approach utilizes pretrained models from image captioning,
image-text alignment, and LLM to consolidate captions from multiple views of a
3D asset, completely side-stepping the time-consuming and costly process of
manual annotation. We apply Cap3D to the recently introduced large-scale 3D
dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted
using 41k human annotations from the same dataset, demonstrates that Cap3D
surpasses human-authored descriptions in terms of quality, cost, and speed.
Through effective prompt engineering, Cap3D rivals human performance in
generating geometric descriptions on 17k collected annotations from the ABO
dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions,
and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E,
and DreamFusion.
Related papers
- View Selection for 3D Captioning via Diffusion Ranking [54.78058803763221]
Cap3D method renders 3D objects into 2D views for captioning using pre-trained models.
Some rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations.
We present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views.
arXiv Detail & Related papers (2024-04-11T17:58:11Z) - Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling [9.440800948514449]
We propose a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling.
Our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images.
We design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes.
arXiv Detail & Related papers (2024-04-03T07:30:09Z) - TPA3D: Triplane Attention for Fast Text-to-3D Generation [28.33270078863519]
We propose Triplane Attention for text-guided 3D generation (TPA3D)
TPA3D is an end-to-end trainable GAN-based deep learning model for fast text-to-3D generation.
We show that TPA3D generates high-quality 3D textured shapes aligned with fine-grained descriptions.
arXiv Detail & Related papers (2023-12-05T10:39:37Z) - 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling [91.99172731031206]
Current text-to-4D methods face a three-way tradeoff between quality of scene appearance, 3D structure, and motion.
We introduce hybrid score distillation sampling, an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models.
arXiv Detail & Related papers (2023-11-29T18:58:05Z) - Control3D: Towards Controllable Text-to-3D Generation [107.81136630589263]
We present a text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D.
A 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF.
We exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene.
arXiv Detail & Related papers (2023-11-09T15:50:32Z) - Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training.
We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud.
Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z) - 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation [107.46972849241168]
3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture.
Experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects.
arXiv Detail & Related papers (2022-12-02T11:31:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.