Related papers: MVOC: a training-free multiple video object composition method with diffusion models

MVOC: a training-free multiple video object composition method with diffusion models

URL: http://arxiv.org/abs/2406.15829v1
Date: Sat, 22 Jun 2024 12:18:46 GMT
Title: MVOC: a training-free multiple video object composition method with diffusion models
Authors: Wei Wang, Yaosen Chen, Yuegen Liu, Qi Yuan, Shubin Yang, Yanru Zhang,
Abstract summary: We propose a Multiple Video Object Composition (MVOC) method based on diffusion models. We first perform DDIM inversion on each video object to obtain the corresponding noise features. Secondly, we combine and edit each object by image editing methods to obtain the first frame of the composited video.
Score: 10.364986401722625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video composition is the core task of video editing. Although image composition based on diffusion models has been highly successful, it is not straightforward to extend the achievement to video object composition tasks, which not only exhibit corresponding interaction effects but also ensure that the objects in the composited video maintain motion and identity consistency, which is necessary to composite a physical harmony video. To address this challenge, we propose a Multiple Video Object Composition (MVOC) method based on diffusion models. Specifically, we first perform DDIM inversion on each video object to obtain the corresponding noise features. Secondly, we combine and edit each object by image editing methods to obtain the first frame of the composited video. Finally, we use the image-to-video generation model to composite the video with feature and attention injections in the Video Object Dependence Module, which is a training-free conditional guidance operation for video generation, and enables the coordination of features and attention maps between various objects that can be non-independent in the composited video. The final generative model not only constrains the objects in the generated video to be consistent with the original object motion and identity, but also introduces interaction effects between objects. Extensive experiments have demonstrated that the proposed method outperforms existing state-of-the-art approaches. Project page: https://sobeymil.github.io/mvoc.com.

Related papers

Generative Video Motion Editing with 3D Point Tracks [66.55707897151909]
We present a track-conditioned V2V framework that enables joint editing of camera and object motion.<n>We achieve this by conditioning a model on a source video and paired 3D point tracks representing source and target motions.<n>Our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation.
arXiv Detail & Related papers (2025-12-01T18:59:55Z)
GenCompositor: Generative Video Compositing with Diffusion Transformer [68.00271033575736]
Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs.<n>This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner.<n>Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
arXiv Detail & Related papers (2025-09-02T16:10:13Z)
VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors [27.685348720003823]
We propose name as a method for editing 3D object compositions in videos of static scenes with camera motion. Our approach allows editing the 3D position of a 3D object across all frames of a video in a temporally consistent manner.
arXiv Detail & Related papers (2025-03-03T02:29:48Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model. We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z)
Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing [46.56615725175025]
We introduce Edit-Your-Motion, a video motion editing method that tackles unseen challenges through one-shot fine-tuning. To effectively decouple motion and appearance of source video, we design atemporal-two-stage learning strategy. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content.
arXiv Detail & Related papers (2024-05-07T17:06:59Z)
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks. Our model can edit and translate the desired results within seconds based on user instructions. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction [82.79642869586587]
WALDO is a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections.
arXiv Detail & Related papers (2022-11-25T18:59:46Z)
Layered Neural Atlases for Consistent Video Editing [37.69447642502351]
We present a method that decomposes, or "unwraps", an input video into a set of layered 2D atlases. For each pixel in the video, our method estimates its corresponding 2D coordinate in each of the atlases. We design our atlases to be interpretable and semantic, which facilitates easy and intuitive editing in the atlas domain.
arXiv Detail & Related papers (2021-09-23T14:58:59Z)
Omnimatte: Associating Objects and Their Effects in Video [100.66205249649131]
Scene effects related to objects in video are typically overlooked by computer vision. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic---it produces omnimattes automatically for arbitrary objects and a variety of effects.
arXiv Detail & Related papers (2021-05-14T17:57:08Z)
First Order Motion Model for Image Animation [90.712718329677]
Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate.
arXiv Detail & Related papers (2020-02-29T07:08:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.