GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark
- URL: http://arxiv.org/abs/2412.09997v1
- Date: Fri, 13 Dec 2024 09:32:08 GMT
- Title: GT23D-Bench: A Comprehensive General Text-to-3D Generation Benchmark
- Authors: Sitong Su, Xiao Cai, Lianli Gao, Pengpeng Zeng, Qinhong Du, Mengqi Li, Heng Tao Shen, Jingkuan Song,
- Abstract summary: GT23D-Bench is first comprehensive benchmark for General Text-to-3D (GT23D)
Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions.
Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness.
- Score: 111.81516104467039
- License:
- Abstract: Recent advances in General Text-to-3D (GT23D) have been significant. However, the lack of a benchmark has hindered systematic evaluation and progress due to issues in datasets and metrics: 1) The largest 3D dataset Objaverse suffers from omitted annotations, disorganization, and low-quality. 2) Existing metrics only evaluate textual-image alignment without considering the 3D-level quality. To this end, we are the first to present a comprehensive benchmark for GT23D called GT23D-Bench consisting of: 1) a 400k high-fidelity and well-organized 3D dataset that curated issues in Objaverse through a systematical annotation-organize-filter pipeline; and 2) comprehensive 3D-aware evaluation metrics which encompass 10 clearly defined metrics thoroughly accounting for multi-dimension of GT23D. Notably, GT23D-Bench features three properties: 1) Multimodal Annotations. Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. 2) Holistic Evaluation Dimensions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness. 3) Valuable Insights. We delve into the performance of current GT23D baselines across different evaluation dimensions and provide insightful analysis. Extensive experiments demonstrate that our annotations and metrics are aligned with human preferences.
Related papers
- Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation [26.0726219629689]
Text-to-3D generation has achieved remarkable progress in recent years, yet evaluating these methods remains challenging.
Existing benchmarks lack fine-grained evaluation on different prompt categories and evaluation dimensions.
We first propose a comprehensive benchmark named MATE-3D.
The benchmark contains eight well-designed prompt categories that cover single and multiple object generation, resulting in 1,280 generated textured meshes.
arXiv Detail & Related papers (2024-12-15T12:41:44Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions [27.469346807311574]
Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene.
Compared to 2D visual grounding, this task presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.
arXiv Detail & Related papers (2024-06-09T13:52:12Z) - Mono3DVG: 3D Visual Grounding in Monocular Images [12.191320182791483]
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information.
We build a large-scale dataset, Mono3DRefer, which contains 3D object targets with corresponding geometric text descriptions.
We propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings.
arXiv Detail & Related papers (2023-12-13T09:49:59Z) - Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance [72.6809373191638]
We propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels.
Specifically, we design a feature-level constraint to align LiDAR and image features based on object-aware regions.
Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations.
Third, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
arXiv Detail & Related papers (2023-12-12T18:57:25Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - Homography Loss for Monocular 3D Object Detection [54.04870007473932]
A differentiable loss function, termed as Homography Loss, is proposed to achieve the goal, which exploits both 2D and 3D information.
Our method yields the best performance compared with the other state-of-the-arts by a large margin on KITTI 3D datasets.
arXiv Detail & Related papers (2022-04-02T03:48:03Z) - From 2D to 3D: Re-thinking Benchmarking of Monocular Depth Prediction [80.67873933010783]
We argue that MDP is currently witnessing benchmark over-fitting and relying on metrics that are only partially helpful to gauge the usefulness of the predictions for 3D applications.
This limits the design and development of novel methods that are truly aware of - and improving towards estimating - the 3D structure of the scene rather than optimizing 2D-based distances.
We propose a set of metrics well suited to evaluate the 3D geometry of MDP approaches and a novel indoor benchmark, RIO-D3D, crucial for the proposed evaluation methodology.
arXiv Detail & Related papers (2022-03-15T17:50:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.