Retrieval-based Knowledge Augmented Vision Language Pre-training
- URL: http://arxiv.org/abs/2304.13923v2
- Date: Sun, 6 Aug 2023 08:06:43 GMT
- Title: Retrieval-based Knowledge Augmented Vision Language Pre-training
- Authors: Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, Yuedong Yang
- Abstract summary: Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
- Score: 9.779887832992435
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With the recent progress in large-scale vision and language representation
learning, Vision Language Pre-training (VLP) models have achieved promising
improvements on various multi-modal downstream tasks. Albeit powerful, these
models have not fully leveraged world knowledge to their advantage. A key
challenge of knowledge-augmented VLP is the lack of clear connections between
knowledge and multi-modal data. Moreover, not all knowledge present in
images/texts is useful, therefore prior approaches often struggle to
effectively integrate knowledge, visual, and textual information. In this
study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL),
a novel knowledge-augmented pre-training framework to address the above issues.
For the first time, we introduce a knowledge-aware self-supervised learning
scheme that efficiently establishes the correspondence between knowledge and
multi-modal data and identifies informative knowledge to improve the modeling
of alignment and interactions between visual and textual modalities. By
adaptively integrating informative knowledge with visual and textual
information, REAVL achieves new state-of-the-art performance uniformly on
knowledge-based vision-language understanding and multi-modal entity linking
tasks, as well as competitive results on general vision-language tasks while
only using 0.2% pre-training data of the best models. Our model shows strong
sample efficiency and effective knowledge utilization.
Related papers
- Improving Contextual Congruence Across Modalities for Effective
Multimodal Marketing using Knowledge-infused Learning [3.3281180957341117]
Large Language (LLMs) and Vision models (LVMs) are still limited in capturing holistic meaning with cross-modal semantic relationships.
We design a framework to couple explicit commonsense knowledge in the form of knowledge graphs with large VLMs to improve the performance of a downstream task.
Our approach enables the early detection of likely persuasive multi-modal campaigns and the assessment and augmentation of marketing theory.
arXiv Detail & Related papers (2024-02-06T00:51:27Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Contrastive Language-Image Pre-Training with Knowledge Graphs [33.211811772961234]
We propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model.
Our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities.
arXiv Detail & Related papers (2022-10-17T09:49:22Z) - LM-CORE: Language Models with Contextually Relevant External Knowledge [13.451001884972033]
We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements.
We present LM-CORE -- a general framework to achieve this -- that allows textitdecoupling of the language model training from the external knowledge source.
Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks.
arXiv Detail & Related papers (2022-08-12T18:59:37Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - JAKET: Joint Pre-training of Knowledge Graph and Language Understanding [73.43768772121985]
We propose a novel joint pre-training framework, JAKET, to model both the knowledge graph and language.
The knowledge module and language module provide essential information to mutually assist each other.
Our design enables the pre-trained model to easily adapt to unseen knowledge graphs in new domains.
arXiv Detail & Related papers (2020-10-02T05:53:36Z) - CoLAKE: Contextualized Language and Knowledge Embedding [81.90416952762803]
We propose the Contextualized Language and Knowledge Embedding (CoLAKE)
CoLAKE jointly learns contextualized representation for both language and knowledge with the extended objective.
We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks.
arXiv Detail & Related papers (2020-10-01T11:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.