Related papers: T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

URL: http://arxiv.org/abs/2310.02977v2
Date: Wed, 17 Apr 2024 09:09:17 GMT
Title: T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation
Authors: Yuze He, Yushi Bai, Matthieu Lin, Wang Zhao, Yubin Hu, Jenny Sheng, Ran Yi, Juanzi Li, Yong-Jin Liu,
Abstract summary: Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Most studies evaluate their results with subjective case studies and user experiments. We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
Score: 52.029698642883226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com.

Related papers

Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction [4.820576346277399]
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities. We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment. Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
arXiv Detail & Related papers (2025-04-02T08:29:42Z)
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation [26.0726219629689]
Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging. Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions. We first propose a comprehensive benchmark named MATE-3D. The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes.
arXiv Detail & Related papers (2024-12-15T12:41:44Z)
GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark [111.81516104467039]
GT23D-Bench is first comprehensive benchmark for General Text-to-3D (GT23D) Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness.
arXiv Detail & Related papers (2024-12-13T09:32:08Z)
SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors [115.66850201977887]
We propose SeMv-3D, a novel framework for general text-to-3d generation. We propose a Triplane Prior Learner that learns triplane priors with 3D spatial features to maintain consistency among different views at the 3D level. We also design a Semantic-aligned View Synthesizer that preserves the alignment between 3D spatial features and textual semantics in latent space.
arXiv Detail & Related papers (2024-10-10T07:02:06Z)
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models [65.22994156658918]
We present a method that learns to generate multi-view images in a single denoising process from real-world data. We design an autoregressive generation that renders more 3D-consistent images at any viewpoint.
arXiv Detail & Related papers (2024-03-04T07:57:05Z)
ControlDreamer: Blending Geometry and Style in Text-to-3D [34.92628800597151]
We introduce multi-view ControlNet, a novel depth-aware multi-view diffusion model trained on datasets from a carefully curated text corpus. Our multi-view ControlNet is then integrated into our two-stage pipeline, ControlDreamer, enabling text-guided generation of stylized 3D models.
arXiv Detail & Related papers (2023-12-02T13:04:54Z)
Control3D: Towards Controllable Text-to-3D Generation [107.81136630589263]
We present a text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D. A 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF. We exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene.
arXiv Detail & Related papers (2023-11-09T15:50:32Z)
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z)
Guide3D: Create 3D Avatars from Text and Image Guidance [55.71306021041785]
Guide3D is a text-and-image-guided generative model for 3D avatar generation based on diffusion models. Our framework produces topologically and structurally correct geometry and high-resolution textures.
arXiv Detail & Related papers (2023-08-18T17:55:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.