Related papers: Replace Anyone in Videos

Replace Anyone in Videos

URL: http://arxiv.org/abs/2409.19911v1
Date: Mon, 30 Sep 2024 03:27:33 GMT
Title: Replace Anyone in Videos
Authors: Xiang Wang, Changxin Gao, Yuehuan Wang, Nong Sang,
Abstract summary: We propose the ReplaceAnyone framework, which focuses on localizing and manipulating human motion in videos. Specifically, we formulate this task as an image-conditioned pose-driven video inpainting paradigm. We introduce diverse mask forms involving regular and irregular shapes to avoid shape leakage and allow granular local control.
Score: 39.4019337319795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in controllable human-centric video generation, particularly with the rise of diffusion models, have demonstrated considerable progress. However, achieving precise and localized control over human motion, e.g., replacing or inserting individuals into videos while exhibiting desired motion patterns, still remains challenging. In this work, we propose the ReplaceAnyone framework, which focuses on localizing and manipulating human motion in videos with diverse and intricate backgrounds. Specifically, we formulate this task as an image-conditioned pose-driven video inpainting paradigm, employing a unified video diffusion architecture that facilitates image-conditioned pose-driven video generation and inpainting within masked video regions. Moreover, we introduce diverse mask forms involving regular and irregular shapes to avoid shape leakage and allow granular local control. Additionally, we implement a two-stage training methodology, initially training an image-conditioned pose driven video generation model, followed by joint training of the video inpainting within masked areas. In this way, our approach enables seamless replacement or insertion of characters while maintaining the desired pose motion and reference appearance within a single framework. Experimental results demonstrate the effectiveness of our method in generating realistic and coherent video content.

Related papers

OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z)
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.<n>It only uses a lightweight sparse control encoder to inject editing signals.<n>It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image. Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
Blended Latent Diffusion under Attention Control for Real-World Video Editing [5.659933808910005]
We propose to adapt a image-level blended latent diffusion model to perform local video editing tasks. Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones. We also introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps.
arXiv Detail & Related papers (2024-09-05T13:23:52Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model. We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z)
Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation [15.569467643817447]
We introduce a technique that concurrently learns both foreground and background dynamics by segregating their movements using distinct motion representations. We train on real-world videos enhanced with this innovative motion depiction approach. To further extend video generation to longer sequences without accumulating errors, we adopt a clip-by-clip generation strategy.
arXiv Detail & Related papers (2024-05-26T00:53:26Z)
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion [52.7394517692186]
We present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern.
arXiv Detail & Related papers (2023-12-07T16:57:26Z)
SAVE: Protagonist Diversification with Structure Agnostic Video Editing [29.693364686494274]
Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. We propose motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow.
arXiv Detail & Related papers (2023-12-05T05:13:20Z)
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z)
DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos. The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance. Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z)
MotionDirector: Motion Customization of Text-to-Video Diffusion Models [24.282240656366714]
Motion Customization aims to adapt existing text-to-video diffusion models to generate videos with customized motion. We propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions.
arXiv Detail & Related papers (2023-10-12T16:26:18Z)
FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z)
Video2StyleGAN: Disentangling Local and Global Variations in a Video [68.70889857355678]
StyleGAN has emerged as a powerful paradigm for facial editing, providing disentangled controls over age, expression, illumination, etc. We introduce Video2StyleGAN that takes a target image and driving video(s) to reenact the local and global locations and expressions from the driving video in the identity of the target image.
arXiv Detail & Related papers (2022-05-27T14:18:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.