Related papers: Text-guided 3D Human Generation from 2D Collections

Text-guided 3D Human Generation from 2D Collections

URL: http://arxiv.org/abs/2305.14312v2
Date: Fri, 20 Oct 2023 17:39:15 GMT
Title: Text-guided 3D Human Generation from 2D Collections
Authors: Tsu-Jui Fu and Wenhan Xiong and Yixin Nie and Jingyu Liu and Barlas O\u{g}uz and William Yang Wang
Abstract summary: We introduce Text-guided 3D Human Generation (texttT3H), where a model is to generate a 3D human, guided by the fashion description. CCH adopts cross-modal attention to fuse compositional human rendering with the extracted fashion semantics. We conduct evaluations on DeepFashion and SHHQ with diverse fashion attributes covering the shape, fabric, and color of upper and lower clothing.
Score: 69.04031635550294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D human modeling has been widely used for engaging interaction in gaming, film, and animation. The customization of these characters is crucial for creativity and scalability, which highlights the importance of controllability. In this work, we introduce Text-guided 3D Human Generation (\texttt{T3H}), where a model is to generate a 3D human, guided by the fashion description. There are two goals: 1) the 3D human should render articulately, and 2) its outfit is controlled by the given text. To address this \texttt{T3H} task, we propose Compositional Cross-modal Human (CCH). CCH adopts cross-modal attention to fuse compositional human rendering with the extracted fashion semantics. Each human body part perceives relevant textual guidance as its visual patterns. We incorporate the human prior and semantic discrimination to enhance 3D geometry transformation and fine-grained consistency, enabling it to learn from 2D collections for data efficiency. We conduct evaluations on DeepFashion and SHHQ with diverse fashion attributes covering the shape, fabric, and color of upper and lower clothing. Extensive experiments demonstrate that CCH achieves superior results for \texttt{T3H} with high efficiency.

Related papers

DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh [102.84518904896737]
DAGSM is a novel pipeline that generates disentangled human bodies and garments from the given text prompts. We first create the unclothed body, followed by a sequence of individual cloth generation based on the body. Experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.
arXiv Detail & Related papers (2024-11-20T07:00:48Z)
FAMOUS: High-Fidelity Monocular 3D Human Digitization Using View Synthesis [51.193297565630886]
The challenge of accurately inferring texture remains, particularly in obscured areas such as the back of a person in frontal-view images. This limitation in texture prediction largely stems from the scarcity of large-scale and diverse 3D datasets. We propose leveraging extensive 2D fashion datasets to enhance both texture and shape prediction in 3D human digitization.
arXiv Detail & Related papers (2024-10-13T01:25:05Z)
2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? [5.408549711581793]
We study the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models. We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.
arXiv Detail & Related papers (2024-09-16T15:06:12Z)
Investigating the impact of 2D gesture representation on co-speech gesture generation [5.408549711581793]
We evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model.
arXiv Detail & Related papers (2024-06-21T12:59:20Z)
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [52.44678180286886]
2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. We propose Sherpa3D, a new text-to-3D framework that achieves high-fidelity, generalizability, and geometric consistency simultaneously.
arXiv Detail & Related papers (2023-12-11T18:59:18Z)
HumanLiff: Layer-wise 3D Human Generation with Diffusion Model [55.891036415316876]
Existing 3D human generative models mainly generate a clothed 3D human as an undetectable 3D model in a single pass. We propose HumanLiff, the first layer-wise 3D human generative model with a unified diffusion process.
arXiv Detail & Related papers (2023-08-18T17:59:04Z)
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans [35.68114652041377]
Existing methods often generate overly smooth back-side surfaces with a blurry texture. Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging descriptive text prompts. We propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field.
arXiv Detail & Related papers (2023-08-16T17:59:13Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)
3D-Aware Semantic-Guided Generative Model for Human Synthesis [67.86621343494998]
This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines.
arXiv Detail & Related papers (2021-12-02T17:10:53Z)
Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement [63.853412753242615]
Learning a good 3D human pose representation is important for human pose related tasks. We propose a novel Siamese denoising autoencoder to learn a 3D pose representation. Our approach achieves state-of-the-art performance on two inherently different tasks.
arXiv Detail & Related papers (2020-07-14T14:25:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.