Detailed Human-Centric Text Description-Driven Large Scene Synthesis
- URL: http://arxiv.org/abs/2311.18654v1
- Date: Thu, 30 Nov 2023 16:04:30 GMT
- Title: Detailed Human-Centric Text Description-Driven Large Scene Synthesis
- Authors: Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun
- Abstract summary: DetText2Scene is a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness.
Our DetText2Scene significantly outperforms prior arts in text-to-large synthesis qualitatively and quantitatively.
- Score: 14.435565761166648
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text-driven large scene image synthesis has made significant progress with
diffusion models, but controlling it is challenging. While using additional
spatial controls with corresponding texts has improved the controllability of
large scene synthesis, it is still challenging to faithfully reflect detailed
text descriptions without user-provided controls. Here, we propose
DetText2Scene, a novel text-driven large-scale image synthesis with high
faithfulness, controllability, and naturalness in a global context for the
detailed human-centric text description. Our DetText2Scene consists of 1)
hierarchical keypoint-box layout generation from the detailed description by
leveraging large language model (LLM), 2) view-wise conditioned joint diffusion
process to synthesize a large scene from the given detailed text with
LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based
pyramidal interpolation to progressively refine the large scene for global
coherence. Our DetText2Scene significantly outperforms prior arts in
text-to-large scene synthesis qualitatively and quantitatively, demonstrating
strong faithfulness with detailed descriptions, superior controllability, and
excellent naturalness in a global context.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - Text2Grasp: Grasp synthesis by text prompts of object grasping parts [4.031699584957737]
The hand plays a pivotal role in human ability to grasp and manipulate objects.
Existing methods that use human intention or task-level language as control signals for grasping inherently face ambiguity.
We propose a grasp synthesis method guided by text prompts of object grasping parts, Text2Grasp, which provides more precise control.
arXiv Detail & Related papers (2024-04-09T10:57:27Z) - Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using
Diffusion Models [63.99110667987318]
We present DiffText, a pipeline that seamlessly blends foreground text with the background's intrinsic features.
With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors.
arXiv Detail & Related papers (2023-11-28T06:51:28Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Person-in-Context Synthesiswith Compositional Structural Space [59.129960774988284]
We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts.
The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated.
To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space''
This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
arXiv Detail & Related papers (2020-08-28T14:33:28Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.