ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
- URL: http://arxiv.org/abs/2403.10004v1
- Date: Fri, 15 Mar 2024 04:02:31 GMT
- Title: ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images
- Authors: Xiangtian Xue, Jiasong Wu, Youyong Kong, Lotfi Senhadji, Huazhong Shu,
- Abstract summary: We present a novel image editing scenario termed Text-grounded Object Generation (TOG)
We propose a universal framework ST-LDM based on Swin-Transformer.
Our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models.
- Score: 9.906943507715779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel image editing scenario termed Text-grounded Object Generation (TOG), defined as generating a new object in the real image spatially conditioned by textual descriptions. Existing diffusion models exhibit limitations of spatial perception in complex real-world scenes, relying on additional modalities to enforce constraints, and TOG imposes heightened challenges on scene comprehension under the weak supervision of linguistic information. We propose a universal framework ST-LDM based on Swin-Transformer, which can be integrated into any latent diffusion model with training-free backward guidance. ST-LDM encompasses a global-perceptual autoencoder with adaptable compression scales and hierarchical visual features, parallel with deformable multimodal transformer to generate region-wise guidance for the subsequent denoising process. We transcend the limitation of traditional attention mechanisms that only focus on existing visual features by introducing deformable feature alignment to hierarchically refine spatial positioning fused with multi-scale visual and linguistic information. Extensive Experiments demonstrate that our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models.
Related papers
- Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.
Expanding images based on text requires global coherence and rich local context understanding.
We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - Image Translation as Diffusion Visual Programmers [52.09889190442439]
Diffusion Visual Programmer (DVP) is a neuro-symbolic image translation framework.
Our framework seamlessly embeds a condition-flexible diffusion model within the GPT architecture.
Extensive experiments demonstrate DVP's remarkable performance, surpassing concurrent arts.
arXiv Detail & Related papers (2024-01-18T05:50:09Z) - One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains.
The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations.
UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z) - Light Field Diffusion for Single-View Novel View Synthesis [32.59286750410843]
Single-view novel view synthesis (NVS) is important but challenging in computer vision.
Recent advancements in NVS have leveraged Denoising Diffusion Probabilistic Models (DDPMs) for their exceptional ability to produce high-fidelity images.
We present Light Field Diffusion (LFD), a novel conditional diffusion-based approach that transcends the conventional reliance on camera pose matrices.
arXiv Detail & Related papers (2023-09-20T03:27:06Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Multimodal-driven Talking Face Generation via a Unified Diffusion-based
Generator [29.58245990622227]
Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio.
Existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature paradigm coupled with unstable GAN frameworks.
We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes.
arXiv Detail & Related papers (2023-05-04T07:01:36Z) - FER-former: Multi-modal Transformer for Facial Expression Recognition [14.219492977523682]
A novel multifarious supervision-steering Transformer for Facial Expression Recognition is proposed in this paper.
Our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision.
Experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
arXiv Detail & Related papers (2023-03-23T02:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.