Related papers: Marmot: Object-Level Self-Correction via Multi-Agent Reasoning

Marmot: Object-Level Self-Correction via Multi-Agent Reasoning

URL: http://arxiv.org/abs/2504.20054v3
Date: Fri, 15 Aug 2025 03:38:28 GMT
Title: Marmot: Object-Level Self-Correction via Multi-Agent Reasoning
Authors: Jiayang Sun, Hongbo Wang, Jie Cao, Huaibo Huang, Ran He,
Abstract summary: Marmot is a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
Score: 55.74860093731475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While diffusion models excel at generating high-quality images, they often struggle with accurate counting, attributes, and spatial relationships in complex multi-object scenes. One potential solution involves employing Multimodal Large Language Model (MLLM) as an AI agent to construct a self-correction framework. However, these approaches heavily rely on the capabilities of the MLLMs used, often fail to account for all objects within the image, and suffer from cumulative distortions during multi-round editing processes. To address these challenges, we propose Marmot, a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting to enhance image-text alignment. First, we employ a large language model as an Object-Aware Agent to perform object-level divide-and-conquer, automatically decomposing self-correction tasks into object-centric subtasks based on image descriptions. For each subtask, we construct an Object Correction System featuring a decision-execution-verification mechanism that operates exclusively on a single object's segmentation mask or the bounding boxes of object pairs, effectively mitigating inter-object interference and enhancing editing reliability. To efficiently integrate correction results from subtasks while avoiding cumulative distortions from multi-stage editing, we propose a Pixel-Domain Stitching Smoother, which employs mask-guided two-stage latent space optimization. This innovation enables parallel processing of subtasks, significantly improving runtime efficiency while preventing distortion accumulation. Extensive experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.

Related papers

Model Merging in the Essential Subspace [78.5390284258307]
Model merging aims to integrate multiple task-specific fine-tuned models into a single multi-task model without additional training.<n>Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models.<n>We propose ESM (Essential Subspace Merging), a robust framework for effective model merging.
arXiv Detail & Related papers (2026-02-23T00:33:38Z)
Hierarchical Scheduling for Multi-Vector Image Retrieval [17.023146933530484]
HiMIR is an efficient scheduling framework for image retrieval.<n>We introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment.<n>Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.
arXiv Detail & Related papers (2025-10-10T03:36:18Z)
Objective Soups: Multilingual Multi-Task Modeling for Speech Processing [69.52720282028385]
Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks.<n>This paper investigates three multi-objective MSP formulations, which we refer to as textbfobjective soup recipes.<n>Our work demonstrates that hierarchical MOO is a more effective and scalable approach for building state-of-the-art MSP models.
arXiv Detail & Related papers (2025-08-12T07:01:09Z)
MDE-Edit: Masked Dual-Editing for Multi-Object Image Editing via Diffusion Models [10.798205956644317]
We propose a training-free, inference-stage optimization approach that enables precise localized image manipulation in complex multi-object scenes, named MDE-Edit.<n>Extensive experiments demonstrate that MDE-Edit outperforms state-of-the-art methods in editing accuracy and visual quality, offering a robust solution for complex multi-object image manipulation tasks.
arXiv Detail & Related papers (2025-05-08T10:01:14Z)
Balancing Task-invariant Interaction and Task-specific Adaptation for Unified Image Fusion [82.74585945197231]
Unified image fusion aims to integrate complementary information from multi-source images, enhancing image quality.<n>Existing general image fusion methods incorporate explicit task identification to enable adaptation to different fusion tasks.<n>We propose a novel unified image fusion framework named "TITA", which balances Task-invariant Interaction and Task-specific Adaptation.
arXiv Detail & Related papers (2025-04-07T15:08:35Z)
HOMER: Homography-Based Efficient Multi-view 3D Object Removal [25.832938786291358]
3D object removal is an important sub-task in 3D scene editing, with broad applications in scene understanding, augmented reality, and robotics.<n>Existing methods struggle to achieve a desirable balance among consistency, usability, and computational efficiency in multi-view settings.<n>We propose a novel pipeline that improves the quality and efficiency of multi-view object mask generation and inpainting.
arXiv Detail & Related papers (2025-01-29T13:12:06Z)
EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation [30.93060152004132]
Learning to manipulate objects from high-dimensional observations presents significant challenges.<n>Recent approaches have utilized large-scale offline data to train models from pixel observations.<n>We propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer.
arXiv Detail & Related papers (2024-12-25T13:50:15Z)
COMO: Cross-Mamba Interaction and Offset-Guided Fusion for Multimodal Object Detection [9.913133285133998]
Single-modal object detection tasks often experience performance degradation when encountering diverse scenarios.<n> multimodal object detection tasks can offer more comprehensive information about object features by integrating data from various modalities.<n>In this paper, we propose a novel approach called the CrOss-Mamba interaction and Offset-guided fusion framework.
arXiv Detail & Related papers (2024-12-24T01:14:48Z)
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models. It balances sparsity allocation and accuracy performance across multiple tasks. It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z)
Multi-Expert Adaptive Selection: Task-Balancing for All-in-One Image Restoration [20.04384107349706]
We propose a multi-expert adaptive selection mechanism for multi-task image restoration. The scheme adaptively selects the most suitable expert from the expert library according to the content of the input image and the prompts of the current task. Experimental results demonstrate that our proposed method is both effective and superior to existing approaches.
arXiv Detail & Related papers (2024-07-27T01:13:07Z)
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing [60.09562648953926]
GenArtist is a unified image generation and editing system coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-07-08T04:30:53Z)
Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing [22.855660721387167]
We transform the spatial-aware image editing task into a combination of two sub-tasks: multi-layered latent decomposition and multi-layered latent fusion. We show that our approach consistently surpasses the latest spatial editing methods, including Self-Guidance and DiffEditor.
arXiv Detail & Related papers (2024-03-21T15:35:42Z)
LoMOE: Localized Multi-Object Editing via Multi-Diffusion [8.90467024388923]
We introduce a novel framework for zero-shot localized multi-object editing through a multi-diffusion process. Our approach leverages foreground masks and corresponding simple text prompts that exert localized influences on the target regions. A combination of cross-attention and background losses within the latent space ensures that the characteristics of the object being edited are preserved.
arXiv Detail & Related papers (2024-03-01T10:46:47Z)
Consolidating Attention Features for Multi-view Image Editing [126.19731971010475]
We focus on spatial control-based geometric manipulations and introduce a method to consolidate the editing process across various views. We introduce QNeRF, a neural radiance field trained on the internal query features of the edited images. We refine the process through a progressive, iterative method that better consolidates queries across the diffusion timesteps.
arXiv Detail & Related papers (2024-02-22T18:50:18Z)
MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion [81.7514869897233]
We develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion. MuLan also adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt.
arXiv Detail & Related papers (2024-02-20T06:14:30Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network. We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.