Related papers: PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation

URL: http://arxiv.org/abs/2508.05091v1
Date: Thu, 07 Aug 2025 07:19:02 GMT
Title: PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation
Authors: Jingxuan He, Busheng Su, Finn Wong,
Abstract summary: We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence.<n>Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation.<n>We show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.
Score: 4.417342791754854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating long, temporally coherent videos with precise control over subject identity and motion is a formidable challenge for current diffusion models, which often suffer from identity drift and are limited to short clips. We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence. Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation, while simultaneously conditioning on pose information at the channel level for fine-grained motion control. To overcome duration limits, PoseGen pioneers an interleaved segment generation method that seamlessly stitches video clips together, using a shared KV cache mechanism and a specialized transition process to ensure background consistency and temporal smoothness. Trained on a remarkably small 33-hour video dataset, extensive experiments show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.

Related papers

LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation [24.86836673853292]
STAGE is an auto-regressive framework that pioneers hierarchical feature coordination and multiphase optimization for sustainable video synthesis.<n>HTFT enhances temporal consistency between video frames throughout the video generation process.<n>We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
arXiv Detail & Related papers (2025-06-16T06:53:05Z)
HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation [39.69554411714128]
We propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a dataset containing 14,000 hours of high-quality video.<n>HumanDiT supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation.<n>Experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
arXiv Detail & Related papers (2025-02-07T11:36:36Z)
Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z)
VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability. An identity-aware appearance controller integrates additional facial information without compromising other appearance details. A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z)
MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.