Interspatial Attention for Efficient 4D Human Video Generation
- URL: http://arxiv.org/abs/2505.15800v2
- Date: Sun, 25 May 2025 17:44:44 GMT
- Title: Interspatial Attention for Efficient 4D Human Video Generation
- Authors: Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein,
- Abstract summary: We introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern video generation models.<n>ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos.<n>Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation.
- Score: 98.36274427702915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.
Related papers
- Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z) - SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency [37.96042037188354]
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.
arXiv Detail & Related papers (2024-07-24T17:59:43Z) - 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [53.89348957053395]
We introduce a novel pipeline designed for text-to-4D scene generation.
Our method begins by generating a reference video using the video generation model.
We then learn the canonical 3D representation of the video using a freeze-time video.
arXiv Detail & Related papers (2024-06-11T17:19:26Z) - Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer [38.85054820740242]
We present a novel approach for generating high-quality, coherent human videos from a single image.
Our framework combines the strengths of diffusion transformers for capturing global correlations and CNNs for accurate condition injection.
We demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos.
arXiv Detail & Related papers (2024-05-27T17:53:29Z) - Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation.
We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets.
Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z) - TC4D: Trajectory-Conditioned Text-to-4D Generation [94.90700997568158]
We propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components.
We learn local deformations that conform to the global trajectory using supervision from a text-to-video model.
Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion.
arXiv Detail & Related papers (2024-03-26T17:55:11Z) - VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image.
We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls.
We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z) - Generative Rendering: Controllable 4D-Guided Video Generation with 2D
Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models.
We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z) - Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses.
Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.