Person-in-Context Synthesiswith Compositional Structural Space
- URL: http://arxiv.org/abs/2008.12679v1
- Date: Fri, 28 Aug 2020 14:33:28 GMT
- Title: Person-in-Context Synthesiswith Compositional Structural Space
- Authors: Weidong Yin, Ziwei Liu, Leonid Sigal
- Abstract summary: We propose a new problem, textbfPersons in Context Synthesis, which aims to synthesize diverse person instance(s) in consistent contexts.
The context is specified by the bounding box object layout which lacks shape information, while pose of the person(s) by keypoints which are sparsely annotated.
To handle the stark difference in input structures, we proposed two separate neural branches to attentively composite the respective (context/person) inputs into shared compositional structural space''
This structural space is then decoded to the image space using multi-level feature modulation strategy, and learned in a self
- Score: 59.129960774988284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant progress, controlled generation of complex images with
interacting people remains difficult. Existing layout generation methods fall
short of synthesizing realistic person instances; while pose-guided generation
approaches focus on a single person and assume simple or known backgrounds. To
tackle these limitations, we propose a new problem, \textbf{Persons in Context
Synthesis}, which aims to synthesize diverse person instance(s) in consistent
contexts, with user control over both. The context is specified by the bounding
box object layout which lacks shape information, while pose of the person(s) by
keypoints which are sparsely annotated. To handle the stark difference in input
structures, we proposed two separate neural branches to attentively composite
the respective (context/person) inputs into shared ``compositional structural
space'', which encodes shape, location and appearance information for both
context and person structures in a disentangled manner. This structural space
is then decoded to the image space using multi-level feature modulation
strategy, and learned in a self supervised manner from image collections and
their corresponding inputs. Extensive experiments on two large-scale datasets
(COCO-Stuff \cite{caesar2018cvpr} and Visual Genome \cite{krishna2017visual})
demonstrate that our framework outperforms state-of-the-art methods w.r.t.
synthesis quality.
Related papers
- LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis [24.925757148750684]
We propose a training-free approach for layout-to-image Synthesis that excels in producing high-quality images aligned with both textual prompts and layout instructions.
LoCo seamlessly integrates into existing text-to-image and layout-to-image models, enhancing their performance in spatial control and addressing semantic failures observed in prior methods.
arXiv Detail & Related papers (2023-11-21T04:28:12Z) - Enhancing Object Coherence in Layout-to-Image Synthesis [13.289854750239956]
We propose a novel diffusion model with effective global semantic fusion (GSF) and self-similarity feature enhancement modules.
For semantic coherence, we argue that the image caption contains rich information for defining the semantic relationship within the objects in the images.
To improve the physical coherence, we develop a Self-similarity Coherence Attention synthesis (SCA) module to explicitly integrate local contextual physical coherence relation into each pixel's generation process.
arXiv Detail & Related papers (2023-11-17T13:43:43Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - SceneComposer: Any-Level Semantic Image Synthesis [80.55876413285587]
We propose a new framework for conditional image synthesis from semantic layouts of any precision levels.
The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level.
We introduce several novel techniques to address the challenges coming with this new setup.
arXiv Detail & Related papers (2022-11-21T18:59:05Z) - Layout-Bridging Text-to-Image Synthesis [20.261873143881573]
We push for effective modeling in both text-to-image generation and layout-to-image synthesis.
We focus on learning the textual-visual semantic alignment per object in the layout to precisely incorporate the input text into the layout-to-image synthesis process.
arXiv Detail & Related papers (2022-08-12T08:21:42Z) - StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis [52.341186561026724]
Lacking compositionality could have severe implications for robustness and fairness.
We introduce a new framework, StyleT2I, to improve the compositionality of text-to-image synthesis.
Results show that StyleT2I outperforms previous approaches in terms of consistency between the input text and synthesized images.
arXiv Detail & Related papers (2022-03-29T17:59:50Z) - Interactive Image Synthesis with Panoptic Layout Generation [14.1026819862002]
We propose Panoptic Layout Generative Adversarial Networks (PLGAN) to address this challenge.
PLGAN employs panoptic theory which distinguishes object categories between "stuff" with amorphous boundaries and "things" with well-defined shapes.
We experimentally compare our PLGAN with state-of-the-art layout-based models on the COCO-Stuff, Visual Genome, and Landscape datasets.
arXiv Detail & Related papers (2022-03-04T02:45:27Z) - Content-aware Warping for View Synthesis [110.54435867693203]
We propose content-aware warping, which adaptively learns the weights for pixels of a relatively large neighborhood from their contextual information via a lightweight neural network.
Based on this learnable warping module, we propose a new end-to-end learning-based framework for novel view synthesis from two source views.
Experimental results on structured light field datasets with wide baselines and unstructured multi-view datasets show that the proposed method significantly outperforms state-of-the-art methods both quantitatively and visually.
arXiv Detail & Related papers (2022-01-22T11:35:05Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.