Related papers: Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

URL: http://arxiv.org/abs/2403.14617v3
Date: Thu, 24 Oct 2024 20:10:30 GMT
Title: Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
Authors: Xiang Fan, Anand Bhattad, Ranjay Krishna,
Abstract summary: Videoshop is a training-free video editing algorithm for localized semantic edits. It allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.
Score: 19.969947635371
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce Videoshop, a training-free video editing algorithm for localized semantic edits. Videoshop allows users to use any editing software, including Photoshop and generative inpainting, to modify the first frame; it automatically propagates those changes, with semantic, spatial, and temporally consistent motion, to the remaining frames. Unlike existing methods that enable edits only through imprecise textual instructions, Videoshop allows users to add or remove objects, semantically change objects, insert stock photos into videos, etc. with fine-grained control over locations and appearance. We achieve this through image-based video editing by inverting latents with noise extrapolation, from which we generate videos conditioned on the edited image. Videoshop produces higher quality edits against 6 baselines on 2 editing benchmarks using 10 evaluation metrics.

Related papers

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning [8.077442711429317]
Video editing using diffusion models has achieved remarkable results in generating high-quality edits for videos.<n>First-frame-guided editing provides control over the first frame, but lacks flexibility over subsequent frames.<n>We propose a mask-based LoRA tuning method that adapts pretrained Image-to-Video (I2V) models for flexible video editing.
arXiv Detail & Related papers (2025-06-11T18:03:55Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z)
Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions [49.14827857853878]
ReimaginedAct comprises video understanding, reasoning, and editing modules. Our method can accept not only direct instructional text prompts but also what if' questions to predict possible action changes.
arXiv Detail & Related papers (2024-03-11T22:46:46Z)
UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [28.140945021777878]
We present UniEdit, a tuning-free framework that supports both video motion and appearance editing. To realize motion editing while preserving source video content, we introduce auxiliary motion-reference and reconstruction branches. The obtained features are then injected into the main editing path via temporal and spatial self-attention layers.
arXiv Detail & Related papers (2024-02-20T17:52:12Z)
MagicStick: Controllable Video Editing via Control Handle Transformations [49.29608051543133]
MagicStick is a controllable video editing method that edits the video properties by utilizing the transformation on the extracted internal control signals. We present experiments on numerous examples within our unified framework. We also compare with shape-aware text-based editing and handcrafted motion video generation, demonstrating our superior temporal consistency and editing capability than previous works.
arXiv Detail & Related papers (2023-12-05T17:58:06Z)
MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing. It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z)
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
Shape-aware Text-driven Layered Video Editing [39.56765973770167]
We present a shape-aware, text-driven video editing method to handle shape changes. We first propagate the deformation field between the input and edited to all frames. We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
arXiv Detail & Related papers (2023-01-30T18:41:58Z)
EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing. We show that EditGAN can manipulate images with an unprecedented level of detail and freedom. We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.