MIVE: New Design and Benchmark for Multi-Instance Video Editing
- URL: http://arxiv.org/abs/2412.12877v1
- Date: Tue, 17 Dec 2024 13:00:04 GMT
- Title: MIVE: New Design and Benchmark for Multi-Instance Video Editing
- Authors: Samuel Teodoro, Agus Gunawan, Soo Ye Kim, Jihyong Oh, Munchurl Kim,
- Abstract summary: MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people)<n>MIVE introduces two key modules: Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing.<n>Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention.
- Score: 27.706882926164724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent AI-based video editing has enabled users to edit videos through simple text prompts, significantly simplifying the editing process. However, recent zero-shot video editing techniques primarily focus on global or single-object edits, which can lead to unintended changes in other parts of the video. When multiple objects require localized edits, existing methods face challenges, such as unfaithful editing, editing leakage, and lack of suitable evaluation datasets and metrics. To overcome these limitations, we propose a zero-shot $\textbf{M}$ulti-$\textbf{I}$nstance $\textbf{V}$ideo $\textbf{E}$diting framework, called MIVE. MIVE is a general-purpose mask-based framework, not dedicated to specific objects (e.g., people). MIVE introduces two key modules: (i) Disentangled Multi-instance Sampling (DMS) to prevent editing leakage and (ii) Instance-centric Probability Redistribution (IPR) to ensure precise localization and faithful editing. Additionally, we present our new MIVE Dataset featuring diverse video scenarios and introduce the Cross-Instance Accuracy (CIA) Score to evaluate editing leakage in multi-instance video editing tasks. Our extensive qualitative, quantitative, and user study evaluations demonstrate that MIVE significantly outperforms recent state-of-the-art methods in terms of editing faithfulness, accuracy, and leakage prevention, setting a new benchmark for multi-instance video editing. The project page is available at https://kaist-viclab.github.io/mive-site/
Related papers
- MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs [82.34547399693966]
Existing methods for lifelong model editing compromise generalization, interfere with past edits, or fail to scale to long editing sequences.<n>We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory.<n>MeMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits.
arXiv Detail & Related papers (2025-06-09T16:16:42Z) - Resolving UnderEdit & OverEdit with Iterative & Neighbor-Assisted Model Editing [7.752740499342269]
Large Language Models (LLMs) are widely deployed in downstream tasks, but keeping their knowledge up-to-date via retraining or fine-tuning is often computationally expensive.<n>Model editing provides a more efficient alternative by updating a targeted subset of parameters, which often follows the locate-and-edit paradigm.<n>We propose two complementary methods: iterative model editing, which applies successive edits to mitigate UnderEdit, and neighbor-assisted model editing, which incorporates neighboring knowledge during editing to reduce OverEdit.
arXiv Detail & Related papers (2025-03-14T21:53:12Z) - V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes [29.80140472486948]
V$2$Edit is a training-free framework for instruction-guided video and 3D scene editing.
We introduce a progressive strategy that decomposes complex editing tasks into simpler subtasks.
We extend V$2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits.
arXiv Detail & Related papers (2025-03-13T17:59:55Z) - Re-Attentional Controllable Video Diffusion Editing [48.052781838711994]
We propose a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method.<n>To align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD)<n>RAD refocuses the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video.
arXiv Detail & Related papers (2024-12-16T12:32:21Z) - A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model [10.736207095604414]
We propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM)
We also propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions.
arXiv Detail & Related papers (2024-11-07T18:20:28Z) - GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models.
Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z) - Action Reimagined: Text-to-Pose Video Editing for Dynamic Human Actions [49.14827857853878]
ReimaginedAct comprises video understanding, reasoning, and editing modules.
Our method can accept not only direct instructional text prompts but also what if' questions to predict possible action changes.
arXiv Detail & Related papers (2024-03-11T22:46:46Z) - LoMOE: Localized Multi-Object Editing via Multi-Diffusion [8.90467024388923]
We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process.
Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions.
A combination of cross-attention and background losses within the latent space ensures that the characteristics of the object being edited are preserved.
arXiv Detail & Related papers (2024-03-01T10:46:47Z) - The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse [58.0132400208411]
Even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks.
benchmarking Large Language Models after each edit is impractically time-consuming and resource-intensive.
We have utilized GPT-3.5 to develop a new dataset, HardEdit, based on hard cases.
arXiv Detail & Related papers (2024-02-15T01:50:38Z) - DUnE: Dataset for Unified Editing [3.7346004746366384]
We introduce DUnE-an editing benchmark where edits are natural language sentences.
We show that retrieval-augmented language modeling can outperform specialized editing techniques.
arXiv Detail & Related papers (2023-11-27T18:56:14Z) - Object-aware Inversion and Reassembly for Image Editing [61.19822563737121]
We propose Object-aware Inversion and Reassembly (OIR) to enable object-level fine-grained editing.
We use our search metric to find the optimal inversion step for each editing pair when editing an image.
Our method achieves superior performance in editing object shapes, colors, materials, categories, etc., especially in multi-object editing scenarios.
arXiv Detail & Related papers (2023-10-18T17:59:02Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - EditGAN: High-Precision Semantic Image Editing [120.49401527771067]
EditGAN is a novel method for high quality, high precision semantic image editing.
We show that EditGAN can manipulate images with an unprecedented level of detail and freedom.
We can also easily combine multiple edits and perform plausible edits beyond EditGAN training data.
arXiv Detail & Related papers (2021-11-04T22:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.