Related papers: AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation

URL: http://arxiv.org/abs/2406.01388v2
Date: Tue, 11 Jun 2024 02:29:53 GMT
Title: AutoStudio: Crafting Consistent Subjects in Multi-turn Interactive Image Generation
Authors: Junhao Cheng, Xi Lu, Hanhui Li, Khun Loun Zai, Baiqiao Yin, Yuhao Cheng, Yiqiang Yan, Xiaodan Liang,
Abstract summary: We introduce a training-free multi-agent framework called AutoStudio for generating interactive images. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well.
Score: 41.990464968024845
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

Related papers

Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control [48.41743234012456]
DisenStudio is a novel framework that can generate text-guided videos for customized multiple subjects. DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics.
arXiv Detail & Related papers (2024-05-21T13:44:55Z)
MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion [81.7514869897233]
We develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion. MuLan also adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt.
arXiv Detail & Related papers (2024-02-20T06:14:30Z)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z)
Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.