K-LITE: Learning Transferable Visual Models with External Knowledge
- URL: http://arxiv.org/abs/2204.09222v1
- Date: Wed, 20 Apr 2022 04:47:01 GMT
- Title: K-LITE: Learning Transferable Visual Models with External Knowledge
- Authors: Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang,
Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt
Keutzer, Trevor Darrell, and Jianfeng Gao
- Abstract summary: K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
- Score: 242.3887854728843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent state-of-the-art computer vision systems are trained from natural
language supervision, ranging from simple object category names to descriptive
captions. This free form of supervision ensures high generality and usability
of the learned visual models, based on extensive heuristics on data collection
to cover as many visual concepts as possible. Alternatively, learning with
external knowledge about images is a promising way which leverages a much more
structured source of supervision. In this paper, we propose K-LITE
(Knowledge-augmented Language-Image Training and Evaluation), a simple strategy
to leverage external knowledge to build transferable visual systems: In
training, it enriches entities in natural language with WordNet and Wiktionary
knowledge, leading to an efficient and scalable approach to learning image
representations that can understand both visual concepts and their knowledge;
In evaluation, the natural language is also augmented with external knowledge
and then used to reference learned visual concepts (or describe new ones) to
enable zero-shot and few-shot transfer of the pre-trained models. We study the
performance of K-LITE on two important computer vision problems, image
classification and object detection, benchmarking on 20 and 13 different
existing datasets, respectively. The proposed knowledge-augmented models show
significant improvement in transfer learning performance over existing methods.
Related papers
- A Vision Check-up for Language Models [61.852026871772914]
We show how a preliminary visual representation learning system can be trained using models of text.
Experiments on self-supervised visual representation learning highlight the potential to train vision models capable of making semantic assessments of natural images.
arXiv Detail & Related papers (2024-01-03T18:09:33Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - Visually-Situated Natural Language Understanding with Contrastive
Reading Model and Frozen Large Language Models [24.456117679941816]
Contrastive Reading Model (Cream) is a novel neural architecture designed to enhance the language-image understanding capability of Large Language Models (LLMs)
Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants.
arXiv Detail & Related papers (2023-05-24T11:59:13Z) - Retrieval-based Knowledge Augmented Vision Language Pre-training [9.779887832992435]
Key challenge of knowledge-augmented pre-training is the lack of clear connections between knowledge and multi-modal data.
In this study, we propose REtrieval-based knowledge Augmented Vision Language (REAVL), a novel knowledge-augmented pre-training framework.
For the first time, we introduce a knowledge-aware self-supervised learning scheme that efficiently establishes the correspondence between knowledge and multi-modal data.
arXiv Detail & Related papers (2023-04-27T02:23:47Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.