Related papers: SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation

SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation

URL: http://arxiv.org/abs/2410.07658v2
Date: Wed, 21 May 2025 07:34:36 GMT
Title: SeMv-3D: Towards Concurrency of Semantic and Multi-view Consistency in General Text-to-3D Generation
Authors: Xiao Cai, Pengpeng Zeng, Lianli Gao, Sitong Su, Heng Tao Shen, Jingkuan Song,
Abstract summary: SeMv-3D is a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation.<n>At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors.<n>We also present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis.
Score: 122.47961178994456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General Text-to-3D (GT23D) generation is crucial for creating diverse 3D content across objects and scenes, yet it faces two key challenges: 1) ensuring semantic consistency between input text and generated 3D models, and 2) maintaining multi-view consistency across different perspectives within 3D. Existing approaches typically address only one of these challenges, often leading to suboptimal results in semantic fidelity and structural coherence. To overcome these limitations, we propose SeMv-3D, a novel framework that jointly enhances semantic alignment and multi-view consistency in GT23D generation. At its core, we introduce Triplane Prior Learning (TPL), which effectively learns triplane priors by capturing spatial correspondences across three orthogonal planes using a dedicated Orthogonal Attention mechanism, thereby ensuring geometric consistency across viewpoints. Additionally, we present Prior-based Semantic Aligning in Triplanes (SAT), which enables consistent any-view synthesis by leveraging attention-based feature alignment to reinforce the correspondence between textual semantics and triplane representations. Extensive experiments demonstrate that our method sets a new state-of-the-art in multi-view consistency, while maintaining competitive performance in semantic consistency compared to methods focused solely on semantic alignment. These results emphasize the remarkable ability of our approach to effectively balance and excel in both dimensions, establishing a new benchmark in the field.

Related papers

Unified Representation Space for 3D Visual Grounding [18.652577474202015]
3D visual grounding aims to identify objects in 3D scenes based on text descriptions.<n>Existing methods rely on separately pre-trained vision and text encoders, resulting in a significant gap between the two modalities.<n>The paper proposes UniSpace-3D, which innovatively introduces a unified representation space for 3DVG.
arXiv Detail & Related papers (2025-06-17T06:53:15Z)
HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment [11.00877062567135]
We propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses.<n> HF-VTON consists of three key modules: the Appearance-Preserving Warp Alignment Module, the Semantic Representation Module, and the Multimodal Prior-Guided Appearance Generation Generation Module.<n> Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS.
arXiv Detail & Related papers (2025-05-26T07:55:49Z)
econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.<n>Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z)
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction [4.820576346277399]
Cross-modal 3D retrieval is a critical yet challenging task, aiming to achieve bi-directional retrieval between 3D and text modalities.<n>We propose to adopt multi-view images and point clouds to jointly represent 3D shapes, facilitating tri-modal alignment.<n>Our method significantly outperforms previous state-of-the-art methods in both shape-to-text and text-to-shape retrieval tasks.
arXiv Detail & Related papers (2025-04-02T08:29:42Z)
Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z)
Focus on Neighbors and Know the Whole: Towards Consistent Dense Multiview Text-to-Image Generator for 3D Creation [64.07560335451723]
CoSER is a novel consistent dense Multiview Text-to-Image Generator for Text-to-3D. It achieves both efficiency and quality by meticulously learning neighbor-view coherence. It aggregates information along motion paths explicitly defined by physical principles to refine details.
arXiv Detail & Related papers (2024-08-23T15:16:01Z)
3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z)
Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment [26.858034573776198]
We propose a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models on aligning the semantics between texts and 2D images. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
arXiv Detail & Related papers (2023-12-15T09:08:14Z)
SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z)
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles. Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z)
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training [51.632418297156605]
We introduce MixCon3D, a method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. We develop the 3D object-level representation from complementary perspectives, e.g., multi-view rendered images with the point cloud. Then, MixCon3D performs language-3D contrastive learning, comprehensively depicting real-world 3D objects and bolstering text alignment.
arXiv Detail & Related papers (2023-11-03T06:05:36Z)
T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Most studies evaluate their results with subjective case studies and user experiments. We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z)
Chasing Consistency in Text-to-3D Generation from a Single Image [35.60887743544786]
We present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. The geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency.
arXiv Detail & Related papers (2023-09-07T09:50:48Z)
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding. For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions. In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.