Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy
- URL: http://arxiv.org/abs/2310.09247v1
- Date: Fri, 13 Oct 2023 16:53:25 GMT
- Title: Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy
- Authors: Anton Baryshnikov, Max Ryabinin
- Abstract summary: We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
- Score: 12.82992353036576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis has recently attracted widespread attention due to
rapidly improving quality and numerous practical applications. However, the
language understanding capabilities of text-to-image models are still poorly
understood, which makes it difficult to reason about prompt formulations that a
given model would understand well. In this work, we measure the capability of
popular text-to-image models to understand $\textit{hypernymy}$, or the "is-a"
relation between words. We design two automatic metrics based on the WordNet
semantic hierarchy and existing image classifiers pretrained on ImageNet. These
metrics both enable broad quantitative comparison of linguistic capabilities
for text-to-image models and offer a way of finding fine-grained qualitative
differences, such as words that are unknown to models and thus are difficult
for them to draw. We comprehensively evaluate popular text-to-image models,
including GLIDE, Latent Diffusion, and Stable Diffusion, showing how our
metrics can provide a better understanding of the individual strengths and
weaknesses of these models.
Related papers
- TypeScore: A Text Fidelity Metric for Text-to-Image Generative Models [39.06617653124486]
We introduce a new evaluation framework called TypeScore to assess a model's ability to generate images with high-fidelity embedded text.
Our proposed metric demonstrates greater resolution than CLIPScore to differentiate popular image generation models.
arXiv Detail & Related papers (2024-11-02T07:56:54Z) - Advanced Multimodal Deep Learning Architecture for Image-Text Matching [33.8315200009152]
Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship.
We introduce an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding.
Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets.
arXiv Detail & Related papers (2024-06-13T08:32:24Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Language Does More Than Describe: On The Lack Of Figurative Speech in
Text-To-Image Models [63.545146807810305]
Text-to-image diffusion models can generate high-quality pictures from textual input prompts.
These models have been trained using text data collected from content-based labelling protocols.
We characterise the sentimentality, objectiveness and degree of abstraction of publicly available text data used to train current text-to-image diffusion models.
arXiv Detail & Related papers (2022-10-19T14:20:05Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.