The Role of Linguistic Priors in Measuring Compositional Generalization
of Vision-Language Models
- URL: http://arxiv.org/abs/2310.02777v1
- Date: Wed, 4 Oct 2023 12:48:33 GMT
- Title: The Role of Linguistic Priors in Measuring Compositional Generalization
of Vision-Language Models
- Authors: Chenwei Wu, Li Erran Li, Stefano Ermon, Patrick Haffner, Rong Ge,
Zaiwei Zhang
- Abstract summary: We identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts.
We propose a new metric for compositionality without such linguistic priors.
- Score: 64.43764443000003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compositionality is a common property in many modalities including natural
languages and images, but the compositional generalization of multi-modal
models is not well-understood. In this paper, we identify two sources of
visual-linguistic compositionality: linguistic priors and the interplay between
images and texts. We show that current attempts to improve compositional
generalization rely on linguistic priors rather than on information in the
image. We also propose a new metric for compositionality without such
linguistic priors.
Related papers
- Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages.
We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts.
We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z) - On Evaluating Multilingual Compositional Generalization with Translated
Datasets [34.51457321680049]
We show that compositional generalization abilities differ across languages.
We craft a faithful rule-based translation of the MCWQ dataset from English to Chinese and Japanese.
Even with the resulting robust benchmark, which we call MCWQ-R, we show that the distribution of compositions still suffers due to linguistic divergences.
arXiv Detail & Related papers (2023-06-20T10:03:57Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Variational Cross-Graph Reasoning and Adaptive Structured Semantics
Learning for Compositional Temporal Grounding [143.5927158318524]
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We argue that the inherent structured semantics inside the videos and language is the crucial factor to achieve compositional generalization.
arXiv Detail & Related papers (2023-01-22T08:02:23Z) - The Role of Syntactic Planning in Compositional Image Captioning [17.363891408746298]
In this work, we investigate methods to improve compositional generalization by planning the syntactic structure of a caption.
Our experiments show that jointly modeling tokens and tags generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.
arXiv Detail & Related papers (2021-01-28T10:26:08Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Compositionality and Generalization in Emergent Languages [42.68870559695238]
We study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations.
We find no correlation between the degree of compositionality of an emergent language and its ability to generalize.
The more compositional a language is, the more easily it will be picked up by new learners.
arXiv Detail & Related papers (2020-04-20T08:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.