A survey on knowledge-enhanced multimodal learning
- URL: http://arxiv.org/abs/2211.12328v3
- Date: Sat, 23 Mar 2024 08:48:14 GMT
- Title: A survey on knowledge-enhanced multimodal learning
- Authors: Maria Lymperaiou, Giorgos Stamou,
- Abstract summary: Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text.
VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
- Score: 1.8591405259852054
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation. Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text. VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other. Massive pre-training procedures enable VL models to acquire a certain level of real-world understanding, although many gaps can be identified: the limited comprehension of commonsense, factual, temporal and other everyday knowledge aspects questions the extendability of VL tasks. Knowledge graphs and other knowledge sources can fill those gaps by explicitly providing missing information, unlocking novel capabilities of VL models. In the same time, knowledge graphs enhance explainability, fairness and validity of decision making, issues of outermost importance for such complex implementations. The current survey aims to unify the fields of VL representation learning and knowledge graphs, and provides a taxonomy and analysis of knowledge-enhanced VL models.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks [48.67062958311173]
VL-GLUE is a multitask benchmark for natural language understanding.
We show that this benchmark is quite challenging for existing large-scale vision-language models.
arXiv Detail & Related papers (2024-10-17T15:27:17Z) - VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering [53.70661720114377]
multimodal large models (MLMs) have significantly advanced the field of visual understanding, offering remarkable capabilities in realm of visual question answering (VQA)
Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate deep comprehension of the visual information in conjunction with a vast repository of learned knowledge.
To uncover such capabilities, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing
arXiv Detail & Related papers (2023-11-13T18:22:32Z) - Retrieval-based Knowledge Augmented Vision Language Pre-training [9.779887832992435]
Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
arXiv Detail & Related papers (2023-04-27T02:23:47Z) - The Contribution of Knowledge in Visiolinguistic Learning: A Survey on
Tasks and Challenges [0.0]
Current datasets used for visiolinguistic (VL) pre-training only contain a limited amount of visual and linguistic knowledge.
External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps.
arXiv Detail & Related papers (2023-03-04T13:12:18Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.