MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2402.05408v2
- Date: Tue, 27 Feb 2024 08:04:11 GMT
- Title: MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
- Authors: Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, Yi Yang
- Abstract summary: We present a Multi-Instance Generation (MIG) task, simultaneously generating multiple instances with diverse controls in one image.
We introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task.
To evaluate how well generation models perform on the MIG task, we provide a COCO-MIG benchmark along with an evaluation pipeline.
- Score: 22.27724733876081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a Multi-Instance Generation (MIG) task, simultaneously generating
multiple instances with diverse controls in one image. Given a set of
predefined coordinates and their corresponding descriptions, the task is to
ensure that generated instances are accurately at the designated locations and
that all instances' attributes adhere to their corresponding description. This
broadens the scope of current research on Single-instance generation, elevating
it to a more versatile and practical dimension. Inspired by the idea of divide
and conquer, we introduce an innovative approach named Multi-Instance
Generation Controller (MIGC) to address the challenges of the MIG task.
Initially, we break down the MIG task into several subtasks, each involving the
shading of a single instance. To ensure precise shading for each instance, we
introduce an instance enhancement attention mechanism. Lastly, we aggregate all
the shaded instances to provide the necessary information for accurately
generating multiple instances in stable diffusion (SD). To evaluate how well
generation models perform on the MIG task, we provide a COCO-MIG benchmark
along with an evaluation pipeline. Extensive experiments were conducted on the
proposed COCO-MIG benchmark, as well as on various commonly used benchmarks.
The evaluation results illustrate the exceptional control capabilities of our
model in terms of quantity, position, attribute, and interaction. Code and
demos will be released at https://migcproject.github.io/.
Related papers
- OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks [77.19223035769248]
Recent breakthroughs in large multimodal models (LMMs) have demonstrated remarkable proficiency in following general-purpose instructions for image generation.<n>We introduce OmniGenBench, a novel benchmark meticulously designed to assess the instruction-following abilities of state-of-the-art LMMs.<n>Our OmniGenBench includes 57 diverse sub-tasks grounded in real-world scenarios, systematically categorized according to the specific model capabilities they demand.
arXiv Detail & Related papers (2025-05-24T16:29:34Z) - IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario.
To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data.
Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z) - MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing [12.491684385808902]
MMO-IG is designed to generate RS images with supervised object labels from global and local aspects simultaneously.
Considering the complex interdependencies among MMOs, we construct a spatial-cross dependency knowledge graph.
Our MMO-IG exhibits superior generation capabilities for RS images with dense MMO-supervised labels.
arXiv Detail & Related papers (2024-12-18T10:19:12Z) - MIGC++: Advanced Multi-Instance Generation Controller for Image Synthesis [33.52454028815209]
We introduce the Multi-Instance Generation (MIG) task, which focuses on generating multiple instances within a single image.
MIG faces three main challenges: avoiding attribute leakage between instances, supporting diverse instance descriptions, and maintaining consistency in iterative generation.
We introduce the COCO-MIG and Multimodal-MIG benchmarks to evaluate these methods.
arXiv Detail & Related papers (2024-07-02T14:59:37Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.
It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.
We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)
It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Generalizable Metric Network for Cross-domain Person Re-identification [55.71632958027289]
Cross-domain (i.e., domain generalization) scene presents a challenge in Re-ID tasks.
Most existing methods aim to learn domain-invariant or robust features for all domains.
We propose a Generalizable Metric Network (GMN) to explore sample similarity in the sample-pair space.
arXiv Detail & Related papers (2023-06-21T03:05:25Z) - PointTAD: Multi-Label Temporal Action Detection with Learnable Query
Points [28.607690605262878]
temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label.
In this paper, we focus on the task of multi-label temporal action detection that aims to localize all action instances from a multi-label untrimmed video.
We extend the sparse query-based detection paradigm from the traditional TAD and propose the multi-label TAD framework of PointTAD.
arXiv Detail & Related papers (2022-10-20T06:08:03Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - UniT: Unified Knowledge Transfer for Any-shot Object Detection and
Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training.
We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.