Text-guided 3D Human Generation from 2D Collections
- URL: http://arxiv.org/abs/2305.14312v2
- Date: Fri, 20 Oct 2023 17:39:15 GMT
- Title: Text-guided 3D Human Generation from 2D Collections
- Authors: Tsu-Jui Fu and Wenhan Xiong and Yixin Nie and Jingyu Liu and Barlas
O\u{g}uz and William Yang Wang
- Abstract summary: We introduce Text-guided 3D Human Generation (texttT3H), where a model is to generate a 3D human, guided by the fashion description.
CCH adopts cross-modal attention to fuse compositional human rendering with the extracted fashion semantics.
We conduct evaluations on DeepFashion and SHHQ with diverse fashion attributes covering the shape, fabric, and color of upper and lower clothing.
- Score: 69.04031635550294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human modeling has been widely used for engaging interaction in gaming,
film, and animation. The customization of these characters is crucial for
creativity and scalability, which highlights the importance of controllability.
In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}),
where a model is to generate a 3D human, guided by the fashion description.
There are two goals: 1) the 3D human should render articulately, and 2) its
outfit is controlled by the given text. To address this \texttt{T3H} task, we
propose Compositional Cross-modal Human (CCH). CCH adopts cross-modal attention
to fuse compositional human rendering with the extracted fashion semantics.
Each human body part perceives relevant textual guidance as its visual
patterns. We incorporate the human prior and semantic discrimination to enhance
3D geometry transformation and fine-grained consistency, enabling it to learn
from 2D collections for data efficiency. We conduct evaluations on DeepFashion
and SHHQ with diverse fashion attributes covering the shape, fabric, and color
of upper and lower clothing. Extensive experiments demonstrate that CCH
achieves superior results for \texttt{T3H} with high efficiency.
Related papers
- FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images.
This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets.
We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z) - 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? [5.408549711581793]
We study the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models.
We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.
arXiv Detail & Related papers (2024-09-16T15:06:12Z) - Investigating the impact of 2D gesture representation on co-speech gesture generation [5.408549711581793]
We evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model.
arXiv Detail & Related papers (2024-06-21T12:59:20Z) - HumanLiff: Layer-wise 3D Human Generation with Diffusion Model [55.891036415316876]
Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass.
We propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process.
arXiv Detail & Related papers (2023-08-18T17:59:04Z) - TeCH: Text-guided Reconstruction of Lifelike Clothed Humans [35.68114652041377]
Existing methods often generate overly smooth back-side surfaces with a blurry texture.
Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging descriptive text prompts.
We propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field.
arXiv Detail & Related papers (2023-08-16T17:59:13Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - 3D-Aware Semantic-Guided Generative Model for Human Synthesis [67.86621343494998]
This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis.
Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines.
arXiv Detail & Related papers (2021-12-02T17:10:53Z) - Unsupervised 3D Human Pose Representation with Viewpoint and Pose
Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks.
We propose a novel Siamese denoising autoencoder to learn a 3D pose representation.
Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.