LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization
- URL: http://arxiv.org/abs/2512.02933v2
- Date: Wed, 03 Dec 2025 03:50:34 GMT
- Title: LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization
- Authors: Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che, Songping Mai, Qi Tian,
- Abstract summary: LoVoRA is a novel framework for mask-free video object removal and addition.<n>Our approach integrates image-to-video translation, optical flow-based mask propagation, and videopainting, enabling temporally consistent edits.<n>LoVoRA achieves end-to-end video editing without requiring external control signals during inference.
- Score: 49.945233586949286
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA. https://cz-5f.github.io/LoVoRA.github.io
Related papers
- LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents [61.91651123290512]
LangDriveCTRL is a framework for editing real-world driving videos to synthesize diverse traffic scenarios.<n>It supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction.
arXiv Detail & Related papers (2025-12-19T10:57:03Z) - EasyV2V: A High-quality Instruction-based Video Editing Framework [108.78294392167017]
captionemphEasyV2V is a framework for instruction-based video editing.<n>EasyV2V works with flexible inputs, e.g., video+text, video+mask+reference+, and state-of-the-art video editing results.
arXiv Detail & Related papers (2025-12-18T18:59:57Z) - DirectDrag: High-Fidelity, Mask-Free, Prompt-Free Drag-based Image Editing via Readout-Guided Feature Alignment [21.951152436940536]
Drag-based image editing using generative models provides intuitive control over image structures.<n>Existing methods rely heavily on manually provided masks and textual prompts to preserve semantic fidelity and motion precision.<n>We propose DirectDrag, a novel mask- and prompt-free editing framework.
arXiv Detail & Related papers (2025-12-03T17:12:00Z) - Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding [45.593989778240655]
A proposed representation achieves high video reconstruction accuracy with fewer parameters.<n>It supports complex video processing tasks, including video in-painting and temporally consistent video editing.
arXiv Detail & Related papers (2025-10-14T08:05:30Z) - MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - Blended Latent Diffusion under Attention Control for Real-World Video Editing [5.659933808910005]
We propose to adapt a image-level blended latent diffusion model to perform local video editing tasks.
Specifically, we leverage DDIM inversion to acquire the latents as background latents instead of the randomly noised ones.
We also introduce an autonomous mask manufacture mechanism derived from cross-attention maps in diffusion steps.
arXiv Detail & Related papers (2024-09-05T13:23:52Z) - Disentangling spatio-temporal knowledge for weakly supervised object detection and segmentation in surgical video [10.287675722826028]
This paper introduces Video Spatio-Temporal Disment Networks (VDST-Net) to disentangle information using semi-decoupled temporal knowledge distillation to predict high-quality class activation maps (CAMs)
We demonstrate the efficacy of our framework on a public reference dataset and on a more challenging surgical video dataset where objects are, on average, present in less than 60% of annotated frames.
arXiv Detail & Related papers (2024-07-22T16:52:32Z) - MotionBooth: Motion-Aware Customized Text-to-Video Generation [44.41894050494623]
MotionBooth is a framework designed for animating customized subjects with precise control over both object and camera movements.
We efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately.
Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance.
arXiv Detail & Related papers (2024-06-25T17:42:25Z) - MotionEditor: Editing Video Motion via Content-Aware Diffusion [96.825431998349]
MotionEditor is a diffusion model for video motion editing.
It incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence.
arXiv Detail & Related papers (2023-11-30T18:59:33Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.