Related papers: SAVE: Protagonist Diversification with Structure Agnostic Video Editing

SAVE: Protagonist Diversification with Structure Agnostic Video Editing

URL: http://arxiv.org/abs/2312.02503v1
Date: Tue, 5 Dec 2023 05:13:20 GMT
Title: SAVE: Protagonist Diversification with Structure Agnostic Video Editing
Authors: Yeji Song, Wonsik Shin, Junsoo Lee, Jeesoo Kim and Nojun Kwak
Abstract summary: Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. We propose motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow.
Score: 29.693364686494274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Driven by the upsurge progress in text-to-image (T2I) generation models, text-to-video (T2V) generation has experienced a significant advance as well. Accordingly, tasks such as modifying the object or changing the style in a video have been possible. However, previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one. In this paper, we spot the bias problem in the existing video editing method that restricts the range of choices for the new protagonist and attempt to address this issue using the conventional image-level personalization method. We adopt motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly. To deal with the natural discrepancy between image and video, we propose a motion word with an inflated textual embedding to properly represent the motion in a source video. We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow, efficiently computed from the pre-calculated attention maps. Finally, we decouple the motion from the appearance of the source video with an additional pseudo word. Extensive experiments demonstrate the editing capability of our method, taking a step toward more diverse and extensive video editing.

Related papers

DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency [66.49423641279374]
We introduce DeCo, a novel video editing framework specifically designed to treat humans and the background as separate editable targets. We propose a decoupled dynamic human representation that utilizes a human body prior to generate tailored humans. We extend the calculation of score distillation sampling into normal space and image space to enhance the texture of humans during the optimization.
arXiv Detail & Related papers (2024-08-14T11:53:40Z)
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
ReVideo: Remake a Video with Motion and Content Control [67.5923127902463]
We present a novel attempt to Remake a Video (VideoRe) which allows precise video editing in specific areas through the specification of both content and motion. VideoRe addresses a new task involving the coupling and training imbalance between content and motion control. Our method can also seamlessly extend these applications to multi-area editing without modifying specific training, demonstrating its flexibility and robustness.
arXiv Detail & Related papers (2024-05-22T17:46:08Z)
Slicedit: Zero-Shot Video Editing With Text-to-Image Diffusion Models Using Spatio-Temporal Slices [19.07572422897737]
We present Slicedit, a method for text-based video editing that utilize a pretrained T2I diffusion model to process both spatial andtemporal slices. Our method generates videos retain the structure and motion of the original video while adhering to the target text.
arXiv Detail & Related papers (2024-05-20T17:55:56Z)
MagicStick: Controllable Video Editing via Control Handle Transformations [49.29608051543133]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z)
MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing. It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z)
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z)
Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame. We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.