Related papers: MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

URL: http://arxiv.org/abs/2406.19680v1
Date: Fri, 28 Jun 2024 06:40:53 GMT
Title: MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou,
Abstract summary: We propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length. confidence-aware pose guidance ensures high frame quality and temporal smoothness. For generating long and smooth videos, we propose a progressive latent fusion strategy.
Score: 11.267119929093042
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .

Related papers

Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance [2.5941932242768457]
Mask-guided video generation can control video generation through mask motion sequences. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
arXiv Detail & Related papers (2025-03-24T06:53:08Z)
VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z)
Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation [2.4240014793575138]
As of this writing, OpenAI's Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length. In this survey, we examine the current landscape of long video generation, covering techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities.
arXiv Detail & Related papers (2024-12-24T21:24:41Z)
DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation [50.66658181705527]
We present DAWN, a framework that enables all-at-once generation of dynamic-length video sequences. DAWN consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements.
arXiv Detail & Related papers (2024-10-17T16:32:36Z)
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos [13.368981834953981]
We propose Fr'echet Video Motion Distance metric, which focuses on evaluating motion consistency in video generation. Specifically, we design explicit motion features based on key point tracking, and then measure the similarity between these features via the Fr'echet distance. We carry out a large-scale human study, demonstrating that our metric effectively detects temporal noise and aligns better with human perceptions of generated video quality than existing metrics.
arXiv Detail & Related papers (2024-07-23T02:10:50Z)
Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z)
AtomoVideo: High Fidelity Image-to-Video Generation [25.01443995920118]
We propose a high fidelity framework for image-to-video generation, named AtomoVideo. Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image. Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation.
arXiv Detail & Related papers (2024-03-04T07:41:50Z)
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z)
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z)
LaMD: Latent Motion Diffusion for Image-Conditional Video Generation [63.34574080016687]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator. LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN.
arXiv Detail & Related papers (2023-04-23T10:32:32Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)
A Good Image Generator Is What You Need for High-Resolution Video Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.