Related papers: Retrieval-based Knowledge Augmented Vision Language Pre-training

Retrieval-based Knowledge Augmented Vision Language Pre-training

URL: http://arxiv.org/abs/2304.13923v2
Date: Sun, 6 Aug 2023 08:06:43 GMT
Title: Retrieval-based Knowledge Augmented Vision Language Pre-training
Authors: Jiahua Rao, Zifei Shan, Longpo Liu, Yao Zhou, Yuedong Yang
Abstract summary: Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework. For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
Score: 9.779887832992435
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the recent progress in large-scale vision and language representation learning, Vision Language Pre-training (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these models have not fully leveraged world knowledge to their advantage. A key challenge of knowledge-augmented VLP is the lack of clear connections between knowledge and multi-modal data. Moreover, not all knowledge present in images/texts is useful, therefore prior approaches often struggle to effectively integrate knowledge, visual, and textual information. In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework to address the above issues. For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data and identifies informative knowledge to improve the modeling of alignment and interactions between visual and textual modalities. By adaptively integrating informative knowledge with visual and textual information, REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multi-modal entity linking tasks, as well as competitive results on general vision-language tasks while only using 0.2% pre-training data of the best models. Our model shows strong sample efficiency and effective knowledge utilization.

Related papers

Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning [0.0]
We propose an Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM) It incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models.
arXiv Detail & Related papers (2025-01-15T05:45:04Z)
KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning [74.21524111840652]
This paper proposes textbfKaLM, a textitKnowledge-aligned Language Modeling approach. It fine-tunes autoregressive large language models to align with KG knowledge via the joint objective of explicit knowledge alignment and implicit knowledge alignment. Notably, our method achieves a significant performance boost in evaluations of knowledge-driven tasks.
arXiv Detail & Related papers (2024-12-06T11:08:24Z)
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings. We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z)
Contrastive Language-Image Pre-Training with Knowledge Graphs [33.211811772961234]
We propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities.
arXiv Detail & Related papers (2022-10-17T09:49:22Z)
LM-CORE: Language Models with Contextually Relevant External Knowledge [13.451001884972033]
We argue that storing large amounts of knowledge in the model parameters is sub-optimal given the ever-growing amounts of knowledge and resource requirements. We present LM-CORE -- a general framework to achieve this -- that allows textitdecoupling of the language model training from the external knowledge source. Experimental results show that LM-CORE, having access to external knowledge, achieves significant and robust outperformance over state-of-the-art knowledge-enhanced language models on knowledge probing tasks.
arXiv Detail & Related papers (2022-08-12T18:59:37Z)
K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems. In training, it enriches entities in natural language with WordNet and Wiktionary knowledge. In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z)
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap. Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z)
Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers. We empirically study the relevance of various KBs to multiple tasks and benchmarks. The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
JAKET: Joint Pre-training of Knowledge Graph and Language Understanding [73.43768772121985]
We propose a novel joint pre-training framework, JAKET, to model both the knowledge graph and language. The knowledge module and language module provide essential information to mutually assist each other. Our design enables the pre-trained model to easily adapt to unseen knowledge graphs in new domains.
arXiv Detail & Related papers (2020-10-02T05:53:36Z)
CoLAKE: Contextualized Language and Knowledge Embedding [81.90416952762803]
We propose the Contextualized Language and Knowledge Embedding (CoLAKE) CoLAKE jointly learns contextualized representation for both language and knowledge with the extended objective. We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks.
arXiv Detail & Related papers (2020-10-01T11:39:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.