TransHuman: A Transformer-based Human Representation for Generalizable
Neural Human Rendering
- URL: http://arxiv.org/abs/2307.12291v1
- Date: Sun, 23 Jul 2023 10:59:51 GMT
- Title: TransHuman: A Transformer-based Human Representation for Generalizable
Neural Human Rendering
- Authors: Xiao Pan, Zongxin Yang, Jianxin Ma, Chang Zhou, Yi Yang
- Abstract summary: We focus on the task of generalizable neural rendering human which trains conditional Radiance Fields (NeRF) from multi-view videos of different characters.
Previous methods have primarily used a SparseConvNet (SPC)-based human representation to process the painted SMPL.
We present a brand-new framework named TransHuman, which learns the painted SMPL under the canonical space and captures the global relationships between human parts.
- Score: 52.59454369653773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we focus on the task of generalizable neural human rendering
which trains conditional Neural Radiance Fields (NeRF) from multi-view videos
of different characters. To handle the dynamic human motion, previous methods
have primarily used a SparseConvNet (SPC)-based human representation to process
the painted SMPL. However, such SPC-based representation i) optimizes under the
volatile observation space which leads to the pose-misalignment between
training and inference stages, and ii) lacks the global relationships among
human parts that is critical for handling the incomplete painted SMPL. Tackling
these issues, we present a brand-new framework named TransHuman, which learns
the painted SMPL under the canonical space and captures the global
relationships between human parts with transformers. Specifically, TransHuman
is mainly composed of Transformer-based Human Encoding (TransHE), Deformable
Partial Radiance Fields (DPaRF), and Fine-grained Detail Integration (FDI).
TransHE first processes the painted SMPL under the canonical space via
transformers for capturing the global relationships between human parts. Then,
DPaRF binds each output token with a deformable radiance field for encoding the
query point under the observation space. Finally, the FDI is employed to
further integrate fine-grained information from reference images. Extensive
experiments on ZJU-MoCap and H36M show that our TransHuman achieves a
significantly new state-of-the-art performance with high efficiency. Project
page: https://pansanity666.github.io/TransHuman/
Related papers
- GenLayNeRF: Generalizable Layered Representations with 3D Model
Alignment for Multi-Human View Synthesis [1.6574413179773757]
GenLayNeRF is a generalizable layered scene representation for free-viewpoint rendering of multiple human subjects.
We divide the scene into multi-human layers anchored by the 3D body meshes.
We extract point-wise image-aligned and human-anchored features which are correlated and fused.
arXiv Detail & Related papers (2023-09-20T20:37:31Z) - ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs [61.677180970486546]
We propose a novel animatable NeRF called ActorsNeRF.
It is first pre-trained on diverse human subjects, and then adapted with few-shot monocular video frames for a new actor with unseen poses.
We quantitatively and qualitatively demonstrate that ActorsNeRF significantly outperforms the existing state-of-the-art on few-shot generalization to new people and poses on multiple datasets.
arXiv Detail & Related papers (2023-04-27T17:58:48Z) - Novel View Synthesis of Humans using Differentiable Rendering [50.57718384229912]
We present a new approach for synthesizing novel views of people in new poses.
Our synthesis makes use of diffuse Gaussian primitives that represent the underlying skeletal structure of a human.
Rendering these primitives gives results in a high-dimensional latent image, which is then transformed into an RGB image by a decoder network.
arXiv Detail & Related papers (2023-03-28T10:48:33Z) - Multimodal Vision Transformers with Forced Attention for Behavior
Analysis [0.0]
We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs.
FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition.
We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
arXiv Detail & Related papers (2022-12-07T21:56:50Z) - REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer [96.64111294772141]
Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person.
Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation.
This paper presents a novel REgionto-whole human MOtion Transfer framework based on GANs.
arXiv Detail & Related papers (2022-09-01T14:03:51Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Human View Synthesis using a Single Sparse RGB-D Input [16.764379184593256]
We present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D.
An enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details.
arXiv Detail & Related papers (2021-12-27T20:13:53Z) - Neural Human Performer: Learning Generalizable Radiance Fields for Human
Performance Rendering [34.80975358673563]
We propose a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.
Experiments on the ZJU-MoCap and AIST datasets show that our method significantly outperforms recent generalizable NeRF methods on unseen identities and poses.
arXiv Detail & Related papers (2021-09-15T17:32:46Z) - ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos [49.337912335944026]
We formulate the problem of Zero-Shot Sign Language Recognition (ZS- SLR) and propose a two-stream model from two input modalities: RGB and Depth videos.
To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation.
Atemporal representation from human body is obtained using vision Transformer and a LSTM network.
arXiv Detail & Related papers (2021-08-23T10:48:18Z) - Transformer Networks for Trajectory Forecasting [11.802437934289062]
We propose the novel use of Transformer Networks for trajectory forecasting.
This is a fundamental switch from the sequential step-by-step processing of LSTMs to the only-attention-based memory mechanisms of Transformers.
arXiv Detail & Related papers (2020-03-18T09:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.