IMAGEdit: Let Any Subject Transform
- URL: http://arxiv.org/abs/2510.01186v1
- Date: Wed, 01 Oct 2025 17:59:56 GMT
- Title: IMAGEdit: Let Any Subject Transform
- Authors: Fei Shen, Weihao Xu, Rui Yan, Dong Zhang, Xiangbo Shu, Jinhui Tang,
- Abstract summary: IMAGEdit is a training-free framework for any number of video subject editing.<n>It manipulates the appearances of multiple designated subjects while preserving non-target regions.<n>It is compatible with any mask-driven video generation model.
- Score: 61.666509860041124
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present IMAGEdit, a training-free framework for any number of video subject editing that manipulates the appearances of multiple designated subjects while preserving non-target regions, without finetuning or retraining. We achieve this by providing robust multimodal conditioning and precise mask sequences through a prompt-guided multimodal alignment module and a prior-based mask retargeting module. We first leverage large models' understanding and generation capabilities to produce multimodal information and mask motion sequences for multiple subjects across various types. Then, the obtained prior mask sequences are fed into a pretrained mask-driven video generation model to synthesize the edited video. With strong generalization capability, IMAGEdit remedies insufficient prompt-side multimodal conditioning and overcomes mask boundary entanglement in videos with any number of subjects, thereby significantly expanding the applicability of video editing. More importantly, IMAGEdit is compatible with any mask-driven video generation model, significantly improving overall performance. Extensive experiments on our newly constructed multi-subject benchmark MSVBench verify that IMAGEdit consistently surpasses state-of-the-art methods. Code, models, and datasets are publicly available at https://github.com/XWH-A/IMAGEdit.
Related papers
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation [60.66986667921744]
iMontage is a unified framework designed to repurpose a powerful video model into an all-in-one image generator.<n>We propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm.<n>This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors.
arXiv Detail & Related papers (2025-11-25T18:54:16Z) - Follow-Your-Creation: Empowering 4D Creation through Video Inpainting [47.08187788419001]
Follow-Your-Creation is a framework capable of generating and editing 4D content from a single monocular video input.<n>By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task.
arXiv Detail & Related papers (2025-06-05T03:11:48Z) - MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - Portrait Video Editing Empowered by Multimodal Generative Priors [39.747581584889495]
We introduce PortraitGen, a powerful portrait video editing method that achieves consistent and expressive stylization with multimodal prompts.
Our approach incorporates multimodal inputs through knowledge distilled from large-scale 2D generative models.
Our system also incorporates expression similarity guidance and a face-aware portrait editing module, effectively mitigating degradation issues associated with iterative dataset updates.
arXiv Detail & Related papers (2024-09-20T15:45:13Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks.
This paper breaks down the text-based video editing task into two stages.
First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way.
Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.