Related papers: Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance

Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance

URL: http://arxiv.org/abs/2509.17757v1
Date: Mon, 22 Sep 2025 13:20:06 GMT
Title: Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance
Authors: Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng,
Abstract summary: Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR.<n>We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues.
Score: 17.81116161163605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.

Related papers

CtrlFuse: Mask-Prompt Guided Controllable Infrared and Visible Image Fusion [51.060328159429154]
Infrared and visible image fusion generates all-weather perception-capable images by combining complementary modalities.<n>We propose CtrlFuse, a controllable image fusion framework that enables interactive dynamic fusion guided by mask prompts.<n> Experiments demonstrate state-of-the-art results in both fusion controllability and segmentation accuracy, with the adapted task branch even outperforming the original segmentation model.
arXiv Detail & Related papers (2026-01-12T13:36:48Z)
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation [44.89730606641666]
IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.<n>IBISAgent consistently outperforms both closed-source and open-source SOTA methods.<n>All datasets, code, and trained models will be released publicly.
arXiv Detail & Related papers (2026-01-06T14:37:50Z)
Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations [56.816929931908824]
We pioneer the detection of semantically-coordinated manipulations in multimodal data.<n>We propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework.<n>Our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-16T04:18:48Z)
Less is More: Empowering GUI Agent with Context-Aware Simplification [62.02157661751793]
We propose a context-aware framework for building an efficient and effective GUI Agent, termed SimpAgent.<n>With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances.
arXiv Detail & Related papers (2025-07-04T17:37:15Z)
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation [15.644911934279309]
Diffusion models have shown excellent performance in text-to-image generation.<n>We propose a Multi-agent Collaboration-based Compositional Diffusion for text-to-image generation for complex scenes.
arXiv Detail & Related papers (2025-05-05T13:50:03Z)
Marmot: Object-Level Self-Correction via Multi-Agent Reasoning [55.74860093731475]
Marmot is a novel and generalizable framework that leverages Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships for image generation tasks.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z)
UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders. We first develop an adaptive feature mask generator to account for the unique significance of nodes. We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z)
Self-Supervised Scene De-occlusion [186.89979151728636]
This paper investigates the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects. We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions. Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion.
arXiv Detail & Related papers (2020-04-06T16:31:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.