Related papers: DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing

DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing

URL: http://arxiv.org/abs/2508.14465v1
Date: Wed, 20 Aug 2025 06:40:34 GMT
Title: DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
Authors: Weitao Wang, Zichen Wang, Hongdeng Shen, Yulei Lu, Xirui Fan, Suhui Wu, Jun Zhang, Haoqian Wang, Hao Zhang,
Abstract summary: We propose a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image.<n>Our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators.
Score: 22.47601749326567
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the rapid progress of video generation, demand for customized video editing is surging, where subject swapping constitutes a key component yet remains under-explored. Prevailing swapping approaches either specialize in narrow domains--such as human-body animation or hand-object interaction--or rely on some indirect editing paradigm or ambiguous text prompts that compromise final fidelity. In this paper, we propose DreamSwapV, a mask-guided, subject-agnostic, end-to-end framework that swaps any subject in any video for customization with a user-specified mask and reference image. To inject fine-grained guidance, we introduce multiple conditions and a dedicated condition fusion module that integrates them efficiently. In addition, an adaptive mask strategy is designed to accommodate subjects of varying scales and attributes, further improving interactions between the swapped subject and its surrounding context. Through our elaborate two-phase dataset construction and training scheme, our DreamSwapV outperforms existing methods, as validated by comprehensive experiments on VBench indicators and our first introduced DreamSwapV-Benchmark.

Related papers

DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
UNIC: Unified In-Context Video Editing [76.76077875564526]
UNified In-Context Video Editing (UNIC) is a framework that unifies diverse video editing tasks within a single model in an in-context manner.<n>We introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks.<n>Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
arXiv Detail & Related papers (2025-06-04T17:57:43Z)
MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement [47.064467920954776]
We introduce MAGREF, a unified and effective framework for any-reference video generation.<n>Our approach incorporates masked guidance and a subject disentanglement mechanism.<n>Experiments on a comprehensive benchmark demonstrate that MAGREF consistently outperforms existing state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-29T17:58:15Z)
Insert Anything: Image Insertion via In-Context Editing in DiT [19.733787045511775]
We present a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance.<n>Our approach is trained once on our new AnyInsertion dataset--comprising 120K prompt-image pairs covering diverse tasks such as person, object, and garment insertion--and effortlessly generalizes to a wide range of insertion scenarios.
arXiv Detail & Related papers (2025-04-21T10:19:12Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing [51.857176097841915]
SwapAnything is a novel framework that can swap any objects in an image with personalized concepts given by the reference. It has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image.
arXiv Detail & Related papers (2024-04-08T17:52:29Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence [37.85691662157054]
Video editing approaches that rely on dense correspondences are ineffective when the target edit involves a shape change. We introduce the VideoSwap framework, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
arXiv Detail & Related papers (2023-12-04T17:58:06Z)
Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos. We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames. Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.