Text-to-Image Generation Grounded by Fine-Grained User Attention
- URL: http://arxiv.org/abs/2011.03775v2
- Date: Tue, 30 Mar 2021 19:52:34 GMT
- Title: Text-to-Image Generation Grounded by Fine-Grained User Attention
- Authors: Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang
- Abstract summary: Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
- Score: 62.94737811887098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localized Narratives is a dataset with detailed natural language descriptions
of images paired with mouse traces that provide a sparse, fine-grained visual
grounding for phrases. We propose TReCS, a sequential model that exploits this
grounding to generate images. TReCS uses descriptions to retrieve segmentation
masks and predict object labels aligned with mouse traces. These alignments are
used to select and position masks to generate a fully covered segmentation
canvas; the final image is produced by a segmentation-to-image generator using
this canvas. This multi-step, retrieval-based approach outperforms existing
direct text-to-image generation models on both automatic metrics and human
evaluations: overall, its generated images are more photo-realistic and better
match descriptions.
Related papers
- Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Beyond Generation: Harnessing Text to Image Models for Object Detection
and Segmentation [29.274362919954218]
We propose a new paradigm to automatically generate training data with accurate labels at scale.
The proposed approach decouples training data generation into foreground object generation, and contextually coherent background generation.
We demonstrate the advantages of our approach on five object detection and segmentation datasets.
arXiv Detail & Related papers (2023-09-12T04:41:45Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - StrucTexTv2: Masked Visual-Textual Prediction for Document Image
Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework.
It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling.
It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z) - DALL-E for Detection: Language-driven Context Image Synthesis for Object
Detection [18.276823176045525]
We propose a new paradigm for automatic context image generation at scale.
At the core of our approach lies utilizing an interplay between language description of context and language-driven image generation.
We demonstrate the advantages of our approach over the prior context image generation approaches on four object detection datasets.
arXiv Detail & Related papers (2022-06-20T06:43:17Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Controllable Image Synthesis via SegVAE [89.04391680233493]
A semantic map is commonly used intermediate representation for conditional image generation.
In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories.
The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder.
arXiv Detail & Related papers (2020-07-16T15:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.