A survey on knowledge-enhanced multimodal learning
- URL: http://arxiv.org/abs/2211.12328v3
- Date: Sat, 23 Mar 2024 08:48:14 GMT
- Title: A survey on knowledge-enhanced multimodal learning
- Authors: Maria Lymperaiou, Giorgos Stamou,
- Abstract summary: Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text.
VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
- Score: 1.8591405259852054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.
Related papers
- VL-ICL Bench: The Devil in the Details of Benchmarking Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Retrieval-based Knowledge Augmented Vision Language Pre-training [9.779887832992435]
Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
arXiv Detail & Related papers (2023-04-27T02:23:47Z) - The Contribution of Knowledge in Visiolinguistic Learning: A Survey on
Tasks and Challenges [0.0]
Current datasets used for visiolinguistic (VL) pre-training only contain a limited amount of visual and linguistic knowledge.
External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps.
arXiv Detail & Related papers (2023-03-04T13:12:18Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Teaching Structured Vision&Language Concepts to Vision&Language Models [46.344585368641006]
We introduce the collective notion of Structured Vision&Language Concepts (SVLC)
SVLC includes object attributes, relations, and states which are present in the text and visible in the image.
We propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs.
arXiv Detail & Related papers (2022-11-21T18:54:10Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.