Im-Promptu: In-Context Composition from Image Prompts
- URL: http://arxiv.org/abs/2305.17262v3
- Date: Mon, 23 Oct 2023 00:45:49 GMT
- Title: Im-Promptu: In-Context Composition from Image Prompts
- Authors: Bhishma Dedhia, Michael Chang, Jake C. Snell, Thomas L. Griffiths,
Niraj K. Jha
- Abstract summary: We investigate whether analogical reasoning can enable in-context composition over composable elements of visual stimuli.
We use Im-Promptu to train agents with different levels of compositionality, including vector representations, patch representations, and object slots.
Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on tasks.
- Score: 10.079743487034762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are few-shot learners that can solve diverse tasks from
a handful of demonstrations. This implicit understanding of tasks suggests that
the attention mechanisms over word tokens may play a role in analogical
reasoning. In this work, we investigate whether analogical reasoning can enable
in-context composition over composable elements of visual stimuli. First, we
introduce a suite of three benchmarks to test the generalization properties of
a visual in-context learner. We formalize the notion of an analogy-based
in-context learner and use it to design a meta-learning framework called
Im-Promptu. Whereas the requisite token granularity for language is well
established, the appropriate compositional granularity for enabling in-context
generalization in visual stimuli is usually unspecified. To this end, we use
Im-Promptu to train multiple agents with different levels of compositionality,
including vector representations, patch representations, and object slots. Our
experiments reveal tradeoffs between extrapolation abilities and the degree of
compositionality, with non-compositional representations extending learned
composition rules to unseen domains but performing poorly on combinatorial
tasks. Patch-based representations require patches to contain entire objects
for robust extrapolation. At the same time, object-centric tokenizers coupled
with a cross-attention module generate consistent and high-fidelity solutions,
with these inductive biases being particularly crucial for compositional
generalization. Lastly, we demonstrate a use case of Im-Promptu as an intuitive
programming interface for image generation.
Related papers
- IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning [94.52149969720712]
IntCoOp learns to jointly align attribute-level inductive biases and class embeddings during prompt-tuning.
IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
arXiv Detail & Related papers (2024-06-19T16:37:31Z) - Contextualized word senses: from attention to compositionality [0.10878040851637999]
We propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words.
Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes.
arXiv Detail & Related papers (2023-12-01T16:04:00Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Semantic Composition in Visually Grounded Language Models [0.0]
We show that visually-grounded language models drastically fail to represent compositional structure.
We introduce WinogroundVQA, a new compositional visual question answering benchmark.
We discuss connections of our work to neuroscience, psycholinguistics, formal semantics, and philosophy.
arXiv Detail & Related papers (2023-05-15T03:19:42Z) - Relate to Predict: Towards Task-Independent Knowledge Representations
for Reinforcement Learning [11.245432408899092]
Reinforcement Learning can enable agents to learn complex tasks.
It is difficult to interpret the knowledge and reuse it across tasks.
In this paper, we introduce an inductive bias for explicit object-centered knowledge separation.
We show that the degree of explicitness in knowledge separation correlates with faster learning, better accuracy, better generalization, and better interpretability.
arXiv Detail & Related papers (2022-12-10T13:33:56Z) - Learning Attention Propagation for Compositional Zero-Shot Learning [71.55375561183523]
We propose a novel method called Compositional Attention Propagated Embedding (CAPE)
CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions.
We show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks.
arXiv Detail & Related papers (2022-10-20T19:44:11Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - On (Emergent) Systematic Generalisation and Compositionality in Visual
Referential Games with Straight-Through Gumbel-Softmax Estimator [0.30458514384586394]
The drivers of compositionality emerge when two (or more) agents play a non-visual referential game.
This paper investigates what extent the drivers of compositionality identified so far in the field apply in the ST-GS context.
Using the ST-GS approach with small batch sizes and an overcomplete communication channel improves compositionality in the emerging languages.
arXiv Detail & Related papers (2020-12-19T20:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.