VideoSwap: Customized Video Subject Swapping with Interactive Semantic
Point Correspondence
- URL: http://arxiv.org/abs/2312.02087v2
- Date: Tue, 5 Dec 2023 17:14:25 GMT
- Title: VideoSwap: Customized Video Subject Swapping with Interactive Semantic
Point Correspondence
- Authors: Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao,
Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Tang
- Abstract summary: Video editing approaches that rely on dense correspondences are ineffective when the target edit involves a shape change.
We introduce the VideoSwap framework, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape.
Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
- Score: 37.85691662157054
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current diffusion-based video editing primarily focuses on
structure-preserved editing by utilizing various dense correspondences to
ensure temporal consistency and motion alignment. However, these approaches are
often ineffective when the target edit involves a shape change. To embark on
video editing with shape change, we explore customized video subject swapping
in this work, where we aim to replace the main subject in a source video with a
target subject having a distinct identity and potentially different shape. In
contrast to previous methods that rely on dense correspondences, we introduce
the VideoSwap framework that exploits semantic point correspondences, inspired
by our observation that only a small number of semantic points are necessary to
align the subject's motion trajectory and modify its shape. We also introduce
various user-point interactions (\eg, removing points and dragging points) to
address various semantic point correspondence. Extensive experiments
demonstrate state-of-the-art video subject swapping results across a variety of
real-world videos.
Related papers
- HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner.
The first stage focuses on object swapping in a single frame with HOI awareness.
The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z) - Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model.
We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z) - Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [63.78538355189017]
We propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video.
Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video.
To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video.
arXiv Detail & Related papers (2023-12-05T18:05:59Z) - SAVE: Protagonist Diversification with Structure Agnostic Video Editing [29.693364686494274]
Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one.
We propose motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly.
We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow.
arXiv Detail & Related papers (2023-12-05T05:13:20Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.