Related papers: Interspatial Attention for Efficient 4D Human Video Generation

Interspatial Attention for Efficient 4D Human Video Generation

URL: http://arxiv.org/abs/2505.15800v2
Date: Sun, 25 May 2025 17:44:44 GMT
Title: Interspatial Attention for Efficient 4D Human Video Generation
Authors: Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein,
Abstract summary: We introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern video generation models.<n>ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos.<n>Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation.
Score: 98.36274427702915
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at https://dsaurus.github.io/isa4d/.

Related papers

Human Video Generation from a Single Image with 3D Pose and View Control [62.676151243249556]
We present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality multi-view,temporally coherent human videos from a single image.<n>HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii)
arXiv Detail & Related papers (2026-02-24T18:42:20Z)
UniMo: Unifying 2D Video and 3D Human Motion with an Autoregressive Framework [54.337290937468175]
We propose UniMo, an autoregressive model for joint modeling of 2D human videos and 3D human motions within a unified framework.<n>We show that our method simultaneously generates corresponding videos and motions while performing accurate motion capture.
arXiv Detail & Related papers (2025-12-03T16:03:18Z)
View-Consistent Diffusion Representations for 3D-Consistent Video Generation [60.68052293389281]
Current generated videos still contain visual artifacts arising from 3D inconsistencies.<n>We propose ViCoDR, a new approach for improving the 3D consistency of video models by learning multi-view consistent diffusion representations.
arXiv Detail & Related papers (2025-11-24T11:16:55Z)
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models [83.76517697509156]
This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input.<n>We propose a novel iterative sliding denoising process to enhance view-temporal consistency of the 4D diffusion model.<n>Our method is able to synthesize high-quality and consistent novel-view videos and significantly outperforms the existing approaches.
arXiv Detail & Related papers (2025-07-17T17:59:17Z)
SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency [37.96042037188354]
We present Stable Video 4D (SV4D), a latent video diffusion model for multi-frame and multi-view consistent dynamic 3D content generation.
arXiv Detail & Related papers (2024-07-24T17:59:43Z)
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [53.89348957053395]
We introduce a novel pipeline designed for text-to-4D scene generation. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video.
arXiv Detail & Related papers (2024-06-11T17:19:26Z)
Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer [38.85054820740242]
We present a novel approach for generating high-quality, coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations and CNNs for accurate condition injection. We demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos.
arXiv Detail & Related papers (2024-05-27T17:53:29Z)
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation. We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z)
TC4D: Trajectory-Conditioned Text-to-4D Generation [94.90700997568158]
We propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion.
arXiv Detail & Related papers (2024-03-26T17:55:11Z)
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis [40.869862603815875]
VLOGGER is a method for audio-driven human video generation from a single input image. We use a novel diffusion-based architecture that augments text-to-image models with both spatial and temporal controls. We show applications in video editing and personalization.
arXiv Detail & Related papers (2024-03-13T17:59:02Z)
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z)
Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.