Related papers: EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

URL: http://arxiv.org/abs/2512.21545v1
Date: Thu, 25 Dec 2025 07:34:38 GMT
Title: EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal
Authors: Sanghyun Jo, Donghwan Lee, Eunji Jung, Seong Je Oh, Kyungsu Kim,
Abstract summary: We propose EraseLoRA, a dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation.<n>First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision.<n>Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces.
Score: 10.015328934927062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.

Related papers

Object-WIPER : Training-Free Object and Associated Effect Removal in Videos [41.50266704357095]
We introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos.<n>We localize relevant visual tokens via visual-text cross-attention and visual self-attention.<n>Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric.
arXiv Detail & Related papers (2026-01-10T02:28:31Z)
Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection [38.14795337940857]
Domain shift reduces the detector's ability to maintain strong object-focused representations.<n>FALCON-SFOD is a framework designed to enhance object-focused adaptation under domain shift.
arXiv Detail & Related papers (2025-12-19T12:30:29Z)
Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance [36.23578004588688]
We propose Foreground-Aware Slot Attention (FASA), a two-stage framework that separates foreground from background to enable precise object discovery.<n>In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions.<n>In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects.<n>Experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-12-02T12:14:05Z)
ObjectClear: Complete Object Removal via Object-Effect Attention [56.2893552300215]
We introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts.<n>We propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks.<n>Experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.
arXiv Detail & Related papers (2025-05-28T17:51:17Z)
Mitigating Context Bias in Domain Adaptation for Object Detection using Mask Pooling [1.1060425537315088]
Context bias refers to the association between the foreground objects and background during the object detection training process.<n>We provide a causal view of the context bias, pointing towards the pooling operation in the convolution network architecture as the possible source of this bias.<n>We present an alternative, Mask Pooling, which uses an additional input of foreground masks, to separate the pooling process in the respective foreground and background regions.
arXiv Detail & Related papers (2025-05-24T01:05:20Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection [55.48770333927732]
We propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection. It consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion's denoising network, and a feature-space pre-trained feature extractor. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-12-11T18:38:28Z)
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization. This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z)
ILSGAN: Independent Layer Synthesis for Unsupervised Foreground-Background Segmentation [49.61394755739333]
Unsupervised foreground-background segmentation aims at extracting salient objects from cluttered backgrounds. We propose a simple-yet-effective explicit layer independence modeling approach, termed Independent Layer Synthesis GAN (ILSGAN) Our ILSGAN achieves strong state-of-the-art generation quality and segmentation performance on complex real-world data.
arXiv Detail & Related papers (2022-11-25T09:35:46Z)
Self-Supervised Video Object Segmentation via Cutout Prediction and Tagging [117.73967303377381]
We propose a novel self-supervised Video Object (VOS) approach that strives to achieve better object-background discriminability. Our approach is based on a discriminative learning loss formulation that takes into account both object and background information. Our proposed approach, CT-VOS, achieves state-of-the-art results on two challenging benchmarks: DAVIS-2017 and Youtube-VOS.
arXiv Detail & Related papers (2022-04-22T17:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.