Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models
- URL: http://arxiv.org/abs/2411.17066v1
- Date: Tue, 26 Nov 2024 03:06:52 GMT
- Title: Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models
- Authors: Colin Conwell, Rupert Tawiah-Quashie, Tomer Ullman,
- Abstract summary: We examine three forms of logical operators: relations, negations, and discrete numbers.
None reliably produce human agreement scores greater than 50%.
We conclude by discussing the limitations inherent to grounded' multimodal learning systems.
- Score: 0.5461938536945723
- License:
- Abstract: Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further clarity on potential sources of success and failure in these text-to-image systems, we supplement our 4 core experiments with multiple auxiliary analyses and schematic diagrams, directly quantifying, for example, the relationship between the N-gram frequency of relational prompts and the average match to generated images; the success rates for 3 different prompt modification strategies in the rendering of negation prompts; and the scalar variability / ratio dependence (`approximate numeracy') of prompts involving integers. We conclude by discussing the limitations inherent to `grounded' multimodal learning systems whose grounding relies heavily on vector-based semantics (e.g. DALL-E 3), or under-specified syntactical constraints (e.g. `grounded diffusion'), and propose minimal modifications (inspired by development, based in imagery) that could help to bridge the lingering compositional gap between scale and structure. All data and code is available at https://github.com/ColinConwell/T2I-Probology
Related papers
- Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph [0.3926357402982764]
We propose a modular approach called BBQ that constructs 3D scene graph representation with metric and semantic edges.
BBQ employs robust DINO-powered associations to construct 3D object-centric map.
We show that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods.
arXiv Detail & Related papers (2024-06-11T09:57:04Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Discovering Failure Modes of Text-guided Diffusion Models via
Adversarial Search [52.519433040005126]
Text-guided diffusion models (TDMs) are widely applied but can fail unexpectedly.
In this work, we aim to study and understand the failure modes of TDMs in more detail.
We propose SAGE, the first adversarial search method on TDMs.
arXiv Detail & Related papers (2023-06-01T17:59:00Z) - Generate, Discriminate and Contrast: A Semi-Supervised Sentence
Representation Learning Framework [68.04940365847543]
We propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data.
Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data.
arXiv Detail & Related papers (2022-10-30T10:15:21Z) - Warp Consistency for Unsupervised Learning of Dense Correspondences [116.56251250853488]
Key challenge in learning dense correspondences is lack of ground-truth matches for real image pairs.
We propose Warp Consistency, an unsupervised learning objective for dense correspondence regression.
Our approach sets a new state-of-the-art on several challenging benchmarks, including MegaDepth, RobotCar and TSS.
arXiv Detail & Related papers (2021-04-07T17:58:22Z) - A Minimalist Dataset for Systematic Generalization of Perception,
Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts.
In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images.
We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.