Object-WIPER : Training-Free Object and Associated Effect Removal in Videos
- URL: http://arxiv.org/abs/2601.06391v1
- Date: Sat, 10 Jan 2026 02:28:31 GMT
- Title: Object-WIPER : Training-Free Object and Associated Effect Removal in Videos
- Authors: Saksham Singh Kushwaha, Sayan Nag, Yapeng Tian, Kuldeep Kulkarni,
- Abstract summary: We introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos.<n>We localize relevant visual tokens via visual-text cross-attention and visual self-attention.<n>Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric.
- Score: 41.50266704357095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
Related papers
- Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction [47.01100029571904]
We study the task of establishing object-level visual correspondence across different viewpoints in videos, focusing on the challenging egocentric-to-exocentric and exocentric-to-egocentric scenarios.<n>We propose a simple yet effective framework based on conditional binary segmentation, where an object query mask is encoded into a latent representation to guide the localization of the corresponding object in a target video.<n> Experiments on the Ego-Exo4D and HANDAL-X benchmarks demonstrate the effectiveness of our optimization objective and TTT strategy, achieving state-of-the-art performance.
arXiv Detail & Related papers (2026-02-22T00:53:03Z) - EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal [10.015328934927062]
We propose EraseLoRA, a dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation.<n>First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision.<n>Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces.
arXiv Detail & Related papers (2025-12-25T07:34:38Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - Text2LIVE: Text-Driven Layered Image and Video Editing [13.134513605107808]
We present a method for zero-shot, text-driven appearance manipulation in natural images and videos.
Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects.
We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.
arXiv Detail & Related papers (2022-04-05T21:17:34Z) - Learning To Segment Dominant Object Motion From Watching Videos [72.57852930273256]
We envision a simple framework for dominant moving object segmentation that neither requires annotated data to train nor relies on saliency priors or pre-trained optical flow maps.
Inspired by a layered image representation, we introduce a technique to group pixel regions according to their affine parametric motion.
This enables our network to learn segmentation of the dominant foreground object using only RGB image pairs as input for both training and inference.
arXiv Detail & Related papers (2021-11-28T14:51:00Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.