Finding Structural Knowledge in Multimodal-BERT
- URL: http://arxiv.org/abs/2203.09306v1
- Date: Thu, 17 Mar 2022 13:20:01 GMT
- Title: Finding Structural Knowledge in Multimodal-BERT
- Authors: Victor Milewski, Miryam de Lhoneux, Marie-Francine Moens
- Abstract summary: We make the inherent structure of language and visuals explicit by a dependency parse of the sentences that describe the image.
We call this explicit visual structure the textitscene tree, that is based on the dependency tree of the language description.
- Score: 18.469318775468754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate the knowledge learned in the embeddings of
multimodal-BERT models. More specifically, we probe their capabilities of
storing the grammatical structure of linguistic data and the structure learned
over objects in visual data. To reach that goal, we first make the inherent
structure of language and visuals explicit by a dependency parse of the
sentences that describe the image and by the dependencies between the object
regions in the image, respectively. We call this explicit visual structure the
\textit{scene tree}, that is based on the dependency tree of the language
description. Extensive probing experiments show that the multimodal-BERT models
do not encode these scene trees.Code available at
\url{https://github.com/VSJMilewski/multimodal-probes}.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Emergent Visual-Semantic Hierarchies in Image-Text Representations [13.300199242824934]
We study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies.
We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding.
arXiv Detail & Related papers (2024-07-11T14:09:42Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene
Graphs with Language Structures via Dependency Relationships [17.930724926012264]
We introduce a new task that targets on inducing a joint vision-language structure in an unsupervised manner.
Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly.
We propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones.
arXiv Detail & Related papers (2022-03-27T09:51:34Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Structured Multi-modal Feature Embedding and Alignment for
Image-Sentence Retrieval [12.050958976545914]
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments.
We propose a novel Structured Multi-modal Feature Embedding and Alignment model for image-sentence retrieval.
In particular, the relations of the visual and textual fragments are modeled by constructing Visual Context-aware Structured Tree encoder (VCS-Tree) and Textual Context-aware Structured Tree encoder (TCS-Tree) with shared labels.
arXiv Detail & Related papers (2021-08-05T07:24:54Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - Linguistic Structure Guided Context Modeling for Referring Image
Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.