The Contribution of Knowledge in Visiolinguistic Learning: A Survey on
Tasks and Challenges
- URL: http://arxiv.org/abs/2303.02411v1
- Date: Sat, 4 Mar 2023 13:12:18 GMT
- Title: The Contribution of Knowledge in Visiolinguistic Learning: A Survey on
Tasks and Challenges
- Authors: Maria Lymperaiou, Giorgos Stamou
- Abstract summary: Current datasets used for visiolinguistic (VL) pre-training only contain a limited amount of visual and linguistic knowledge.
External knowledge sources such as knowledge graphs (KGs) and Large Language Models (LLMs) are able to cover such generalization gaps.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advancements in visiolinguistic (VL) learning have allowed the
development of multiple models and techniques that offer several impressive
implementations, able to currently resolve a variety of tasks that require the
collaboration of vision and language. Current datasets used for VL pre-training
only contain a limited amount of visual and linguistic knowledge, thus
significantly limiting the generalization capabilities of many VL models.
External knowledge sources such as knowledge graphs (KGs) and Large Language
Models (LLMs) are able to cover such generalization gaps by filling in missing
knowledge, resulting in the emergence of hybrid architectures. In the current
survey, we analyze tasks that have benefited from such hybrid approaches.
Moreover, we categorize existing knowledge sources and types, proceeding to
discussion regarding the KG vs LLM dilemma and its potential impact to future
hybrid approaches.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Coding for Intelligence from the Perspective of Category [66.14012258680992]
Coding targets compressing and reconstructing data, and intelligence.
Recent trends demonstrate the potential homogeneity of these two fields.
We propose a novel problem of Coding for Intelligence from the category theory view.
arXiv Detail & Related papers (2024-07-01T07:05:44Z) - A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models [71.25225058845324]
Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation.
Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge.
RA-LLMs have emerged to harness external and authoritative knowledge bases, rather than relying on the model's internal knowledge.
arXiv Detail & Related papers (2024-05-10T02:48:45Z) - VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning [12.450293825734313]
Large language models (LLMs) famously exhibit emergent in-context learning (ICL)
This study introduces a benchmark VL-ICL Bench for multimodal in-context learning.
We evaluate the abilities of state-of-the-art VLLMs against this benchmark suite.
arXiv Detail & Related papers (2024-03-19T21:31:56Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification [0.8232137862012223]
This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information.
To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors.
Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements.
arXiv Detail & Related papers (2024-03-18T18:08:44Z) - MechGPT, a language-based strategy for mechanics and materials modeling
that connects knowledge across scales, disciplines and modalities [0.0]
We use a Large Language Model (LLM) to distill question-answer pairs from raw sources followed by fine-tuning.
The resulting MechGPT LLM foundation model is used in a series of computational experiments to explore its capacity for knowledge retrieval, various language tasks, hypothesis generation, and connecting knowledge across disparate areas.
arXiv Detail & Related papers (2023-10-16T14:29:35Z) - A survey on knowledge-enhanced multimodal learning [1.8591405259852054]
Multimodal learning has been a field of increasing interest, aiming to combine various modalities in a single joint representation.
Especially in the area of visiolinguistic (VL) learning multiple models and techniques have been developed, targeting a variety of tasks that involve images and text.
VL models have reached unprecedented performances by extending the idea of Transformers, so that both modalities can learn from each other.
arXiv Detail & Related papers (2022-11-19T14:00:50Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.