Composition and Deformance: Measuring Imageability with a Text-to-Image
Model
- URL: http://arxiv.org/abs/2306.03168v1
- Date: Mon, 5 Jun 2023 18:22:23 GMT
- Title: Composition and Deformance: Measuring Imageability with a Text-to-Image
Model
- Authors: Si Wu, David A. Smith
- Abstract summary: We propose methods that use generated images to measure the imageability of single English words and connected text.
We find high correlation between the proposed computational measures of imageability and human judgments of individual words.
We discuss possible effects of model training and implications for the study of compositionality in text-to-image models.
- Score: 8.008504325316327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although psycholinguists and psychologists have long studied the tendency of
linguistic strings to evoke mental images in hearers or readers, most
computational studies have applied this concept of imageability only to
isolated words. Using recent developments in text-to-image generation models,
such as DALLE mini, we propose computational methods that use generated images
to measure the imageability of both single English words and connected text. We
sample text prompts for image generation from three corpora: human-generated
image captions, news article sentences, and poem lines. We subject these
prompts to different deformances to examine the model's ability to detect
changes in imageability caused by compositional change. We find high
correlation between the proposed computational measures of imageability and
human judgments of individual words. We also find the proposed measures more
consistently respond to changes in compositionality than baseline approaches.
We discuss possible effects of model training and implications for the study of
compositionality in text-to-image models.
Related papers
- Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Understanding Subjectivity through the Lens of Motivational Context in Model-Generated Image Satisfaction [21.00784031928471]
Image generation models are poised to become ubiquitous in a range of applications.
These models are often fine-tuned and evaluated using human quality judgments that assume a universal standard.
To investigate how to quantify subjectivity, and the scale of its impact, we measure how assessments differ among human annotators across different use cases.
arXiv Detail & Related papers (2024-02-27T01:16:55Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Affect-Conditioned Image Generation [0.9668407688201357]
We introduce a method for generating images conditioned on desired affect, quantified using a psychometrically validated three-component approach.
We first train a neural network for estimating the affect content of text and images from semantic embeddings, and then demonstrate how this can be used to exert control over a variety of generative models.
arXiv Detail & Related papers (2023-02-20T03:44:04Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Cross-Modal Coherence for Text-to-Image Retrieval [35.82045187976062]
We train a Cross-Modal Coherence Modelfor text-to-image retrieval task.
Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models.
Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
arXiv Detail & Related papers (2021-09-22T21:31:27Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.