Semantic Composition in Visually Grounded Language Models
- URL: http://arxiv.org/abs/2305.16328v1
- Date: Mon, 15 May 2023 03:19:42 GMT
- Title: Semantic Composition in Visually Grounded Language Models
- Authors: Rohan Pandey
- Abstract summary: We show that visually-grounded language models drastically fail to represent compositional structure.
We introduce WinogroundVQA, a new compositional visual question answering benchmark.
We discuss connections of our work to neuroscience, psycholinguistics, formal semantics, and philosophy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: What is sentence meaning and its ideal representation? Much of the expressive
power of human language derives from semantic composition, the mind's ability
to represent meaning hierarchically & relationally over constituents. At the
same time, much sentential meaning is outside the text and requires grounding
in sensory, motor, and experiential modalities to be adequately learned.
Although large language models display considerable compositional ability,
recent work shows that visually-grounded language models drastically fail to
represent compositional structure. In this thesis, we explore whether & how
models compose visually grounded semantics, and how we might improve their
ability to do so.
Specifically, we introduce 1) WinogroundVQA, a new compositional visual
question answering benchmark, 2) Syntactic Neural Module Distillation, a
measure of compositional ability in sentence embedding models, 3) Causal
Tracing for Image Captioning Models to locate neural representations vital for
vision-language composition, 4) Syntactic MeanPool to inject a compositional
inductive bias into sentence embeddings, and 5) Cross-modal Attention
Congruence Regularization, a self-supervised objective function for
vision-language relation alignment. We close by discussing connections of our
work to neuroscience, psycholinguistics, formal semantics, and philosophy.
Related papers
- A Complexity-Based Theory of Compositionality [53.025566128892066]
In AI, compositional representations can enable a powerful form of out-of-distribution generalization.
Here, we propose a formal definition of compositionality that accounts for and extends our intuitions about compositionality.
The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation.
arXiv Detail & Related papers (2024-10-18T18:37:27Z) - Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - Contextualized word senses: from attention to compositionality [0.10878040851637999]
We propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words.
Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes.
arXiv Detail & Related papers (2023-12-01T16:04:00Z) - DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z) - Im-Promptu: In-Context Composition from Image Prompts [10.079743487034762]
We investigate whether analogical reasoning can enable in-context composition over composable elements of visual stimuli.
We use Im-Promptu to train agents with different levels of compositionality, including vector representations, patch representations, and object slots.
Our experiments reveal tradeoffs between extrapolation abilities and the degree of compositionality, with non-compositional representations extending learned composition rules to unseen domains but performing poorly on tasks.
arXiv Detail & Related papers (2023-05-26T21:10:11Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Compositional Processing Emerges in Neural Networks Solving Math
Problems [100.80518350845668]
Recent progress in artificial neural networks has shown that when large models are trained on enough linguistic data, grammatical structure emerges in their representations.
We extend this work to the domain of mathematical reasoning, where it is possible to formulate precise hypotheses about how meanings should be composed.
Our work shows that neural networks are not only able to infer something about the structured relationships implicit in their training data, but can also deploy this knowledge to guide the composition of individual meanings into composite wholes.
arXiv Detail & Related papers (2021-05-19T07:24:42Z) - Natural Language Rationales with Full-Stack Visual Reasoning: From
Pixels to Semantic Frames to Commonsense Graphs [106.15931418425906]
We present the first study focused on generating natural language rationales across several complex visual reasoning tasks.
We present RationaleVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.
arXiv Detail & Related papers (2020-10-15T05:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.