Grounded and Controllable Image Completion by Incorporating Lexical
Semantics
- URL: http://arxiv.org/abs/2003.00303v1
- Date: Sat, 29 Feb 2020 16:54:21 GMT
- Title: Grounded and Controllable Image Completion by Incorporating Lexical
Semantics
- Authors: Shengyu Zhang, Tan Jiang, Qinghao Huang, Ziqi Tan, Zhou Zhao, Siliang
Tang, Jin Yu, Hongxia Yang, Yi Yang, and Fei Wu
- Abstract summary: Lexical Semantic Image Completion (LSIC) may have potential applications in art, design, and heritage conservation.
We advocate generating results faithful to both visual and lexical semantic context.
One major challenge for LSIC comes from modeling and aligning the structure of visual-semantic context.
- Score: 111.47374576372813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present an approach, namely Lexical Semantic Image
Completion (LSIC), that may have potential applications in art, design, and
heritage conservation, among several others. Existing image completion
procedure is highly subjective by considering only visual context, which may
trigger unpredictable results which are plausible but not faithful to a
grounded knowledge. To permit both grounded and controllable completion
process, we advocate generating results faithful to both visual and lexical
semantic context, i.e., the description of leaving holes or blank regions in
the image (e.g., hole description). One major challenge for LSIC comes from
modeling and aligning the structure of visual-semantic context and translating
across different modalities. We term this process as structure completion,
which is realized by multi-grained reasoning blocks in our model. Another
challenge relates to the unimodal biases, which occurs when the model generates
plausible results without using the textual description. This can be true since
the annotated captions for an image are often semantically equivalent in
existing datasets, and thus there is only one paired text for a masked image in
training. We devise an unsupervised unpaired-creation learning path besides the
over-explored paired-reconstruction path, as well as a multi-stage training
strategy to mitigate the insufficiency of labeled data. We conduct extensive
quantitative and qualitative experiments as well as ablation studies, which
reveal the efficacy of our proposed LSIC.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Exploring Semantic Consistency in Unpaired Image Translation to Generate
Data for Surgical Applications [1.8011391924021904]
This study empirically investigates unpaired image translation methods for generating suitable data in surgical applications.
We find that a simple combination of structural-similarity loss and contrastive learning yields the most promising results.
arXiv Detail & Related papers (2023-09-06T14:43:22Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Unpaired Translation from Semantic Label Maps to Images by Leveraging
Domain-Specific Simulations [11.638139969660266]
We introduce a contrastive learning framework for generating photorealistic images from simulated label maps.
Our proposed method is shown to generate realistic and scene-accurate translations.
arXiv Detail & Related papers (2023-02-21T14:36:18Z) - More Control for Free! Image Synthesis with Semantic Diffusion Guidance [79.88929906247695]
Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from an example image.
We introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis.
arXiv Detail & Related papers (2021-12-10T18:55:50Z) - USIS: Unsupervised Semantic Image Synthesis [9.613134538472801]
We propose a new Unsupervised paradigm for Semantic Image Synthesis (USIS)
USIS learns to output images with visually separable semantic classes using a self-supervised segmentation loss.
In order to match the color and texture distribution of real images without losing high-frequency information, we propose to use whole image wavelet-based discrimination.
arXiv Detail & Related papers (2021-09-29T20:48:41Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.