UIKA: Fast Universal Head Avatar from Pose-Free Images
- URL: http://arxiv.org/abs/2601.07603v2
- Date: Fri, 16 Jan 2026 12:26:41 GMT
- Title: UIKA: Fast Universal Head Avatar from Pose-Free Images
- Authors: Zijian Wu, Boyao Zhou, Liangxiao Hu, Hongyu Liu, Yuan Sun, Xuan Wang, Xun Cao, Yujun Shen, Hao Zhu,
- Abstract summary: We present Uika, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs.<n>Unlike the traditional avatar method, we rethink the task through the lenses of model representation, network design, and data preparation.<n>Our method significantly outperforms existing approaches in both monocular and multi-view settings.
- Score: 65.03770342532134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present UIKA, a feed-forward animatable Gaussian head model from an arbitrary number of unposed inputs, including a single image, multi-view captures, and smartphone-captured videos. Unlike the traditional avatar method, which requires a studio-level multi-view capture system and reconstructs a human-specific model through a long-time optimization process, we rethink the task through the lenses of model representation, network design, and data preparation. First, we introduce a UV-guided avatar modeling strategy, in which each input image is associated with a pixel-wise facial correspondence estimation. Such correspondence estimation allows us to reproject each valid pixel color from screen space to UV space, which is independent of camera pose and character expression. Furthermore, we design learnable UV tokens on which the attention mechanism can be applied at both the screen and UV levels. The learned UV tokens can be decoded into canonical Gaussian attributes using aggregated UV information from all input views. To train our large avatar model, we additionally prepare a large-scale, identity-rich synthetic training dataset. Our method significantly outperforms existing approaches in both monocular and multi-view settings. See more details in our project page: https://zijian-wu.github.io/uika-page/
Related papers
- FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision [54.69512425050288]
We introduce FlexAvatar, a method for creating high-quality and complete 3D head avatars from a single image.<n>Our training procedure yields a smooth latent avatar space that facilitates identity and flexible fitting to an arbitrary number of input observations.
arXiv Detail & Related papers (2025-12-17T17:09:52Z) - Audio-Driven Universal Gaussian Head Avatars [66.56656075831954]
We introduce the first method for audio-driven universal photorealistic avatar synthesis.<n>It combines a person-agnostic speech model with our novel Universal Head Avatar Prior.<n>Our method is not only the first general audio-driven avatar model that can account for detailed appearance modeling and rendering.
arXiv Detail & Related papers (2025-09-23T12:46:43Z) - Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars [20.807609264738865]
We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image.<n>This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians.<n>Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy.
arXiv Detail & Related papers (2025-07-21T18:20:09Z) - UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling [71.87807614875497]
We propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures.
We collect and process a new dataset of human motion, which includes multi-view images, scanned models, parametric model registration, and corresponding texture maps. Experimental results demonstrate that our method achieves state-of-the-art synthesis of novel view and novel pose.
arXiv Detail & Related papers (2024-03-18T09:03:56Z) - NViST: In the Wild New View Synthesis from a Single Image with Transformers [8.361847255300846]
We propose NViST, a transformer-based model for efficient novel-view synthesis from a single image.
NViST is trained on MVImgNet, a large-scale dataset of casually-captured real-world videos.
We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures.
arXiv Detail & Related papers (2023-12-13T23:41:17Z) - HQ3DAvatar: High Quality Controllable 3D Head Avatar [65.70885416855782]
This paper presents a novel approach to building highly photorealistic digital head avatars.
Our method learns a canonical space via an implicit function parameterized by a neural network.
At test time, our method is driven by a monocular RGB video.
arXiv Detail & Related papers (2023-03-25T13:56:33Z) - You Only Train Once: Multi-Identity Free-Viewpoint Neural Human
Rendering from Monocular Videos [10.795522875068073]
You Only Train Once (YOTO) is a dynamic human generation framework, which performs free-viewpoint rendering of different human identities with distinct motions.
In this paper, we propose a set of learnable identity codes to expand the capability of the framework for multi-identity free-viewpoint rendering.
YOTO shows state-of-the-art performance on all evaluation metrics while showing significant benefits in training and inference efficiency as well as rendering quality.
arXiv Detail & Related papers (2023-03-10T10:23:17Z) - PINA: Learning a Personalized Implicit Neural Avatar from a Single RGB-D
Video Sequence [60.46092534331516]
We present a novel method to learn Personalized Implicit Neural Avatars (PINA) from a short RGB-D sequence.
PINA does not require complete scans, nor does it require a prior learned from large datasets of clothed humans.
We propose a method to learn the shape and non-rigid deformations via a pose-conditioned implicit surface and a deformation field.
arXiv Detail & Related papers (2022-03-03T15:04:55Z) - Neural Human Performer: Learning Generalizable Radiance Fields for Human
Performance Rendering [34.80975358673563]
We propose a novel approach that learns generalizable neural radiance fields based on a parametric human body model for robust performance capture.
Experiments on the ZJU-MoCap and AIST datasets show that our method significantly outperforms recent generalizable NeRF methods on unseen identities and poses.
arXiv Detail & Related papers (2021-09-15T17:32:46Z) - pixelNeRF: Neural Radiance Fields from One or Few Images [20.607712035278315]
pixelNeRF is a learning framework that predicts a continuous neural scene representation conditioned on one or few input images.
We conduct experiments on ShapeNet benchmarks for single image novel view synthesis tasks with held-out objects.
In all cases, pixelNeRF outperforms current state-of-the-art baselines for novel view synthesis and single image 3D reconstruction.
arXiv Detail & Related papers (2020-12-03T18:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.