X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic
Textual Guidance
- URL: http://arxiv.org/abs/2303.15764v2
- Date: Fri, 4 Aug 2023 15:07:54 GMT
- Title: X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic
Textual Guidance
- Authors: Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang,
Guannan Jiang, Weilin Zhuang, Rongrong Ji
- Abstract summary: X-Mesh is a text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module.
We introduce a new standard text-mesh benchmark, MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons.
- Score: 70.08635216710967
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven 3D stylization is a complex and crucial task in the fields of
computer vision (CV) and computer graphics (CG), aimed at transforming a bare
mesh to fit a target text. Prior methods adopt text-independent multilayer
perceptrons (MLPs) to predict the attributes of the target mesh with the
supervision of CLIP loss. However, such text-independent architecture lacks
textual guidance during predicting attributes, thus leading to unsatisfactory
stylization and slow convergence. To address these limitations, we present
X-Mesh, an innovative text-driven 3D stylization framework that incorporates a
novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically
integrates the guidance of the target text by utilizing text-relevant spatial
and channel-wise attentions during vertex feature extraction, resulting in more
accurate attribute prediction and faster convergence speed. Furthermore,
existing works lack standard benchmarks and automated metrics for evaluation,
often relying on subjective and non-reproducible user studies to assess the
quality of stylized 3D assets. To overcome this limitation, we introduce a new
standard text-mesh benchmark, namely MIT-30, and two automated metrics, which
will enable future research to achieve fair and objective comparisons. Our
extensive qualitative and quantitative experiments demonstrate that X-Mesh
outperforms previous state-of-the-art methods.
Related papers
- Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation [52.029698642883226]
Methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF.
Most studies evaluate their results with subjective case studies and user experiments.
We introduce T$3$Bench, the first comprehensive text-to-3D benchmark.
arXiv Detail & Related papers (2023-10-04T17:12:18Z) - Looking at words and points with attention: a benchmark for
text-to-shape coherence [17.340484439401894]
The evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark.
We employ large language models to automatically refine descriptions associated with shapes.
To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones.
The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark.
arXiv Detail & Related papers (2023-09-14T17:59:48Z) - ATT3D: Amortized Text-to-3D Object Synthesis [78.96673650638365]
We amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately.
Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooths between text for novel assets and simple animations.
arXiv Detail & Related papers (2023-06-06T17:59:10Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion
Data and Natural Language [4.86658723641864]
We propose a novel text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural description.
Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions.
arXiv Detail & Related papers (2023-05-25T08:32:41Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.