Transferring Knowledge from Vision to Language: How to Achieve it and
how to Measure it?
- URL: http://arxiv.org/abs/2109.11321v1
- Date: Thu, 23 Sep 2021 12:11:23 GMT
- Title: Transferring Knowledge from Vision to Language: How to Achieve it and
how to Measure it?
- Authors: Tobias Norlund, Lovisa Hagstr\"om, Richard Johanssom
- Abstract summary: We propose a method for evaluating visual knowledge transfer to text for uni- or multimodal language models.
We find that our method can successfully be used to measure visual knowledge transfer capabilities in models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are known to suffer from the hallucination problem in
that they are prone to output statements that are false or inconsistent,
indicating a lack of knowledge. A proposed solution to this is to provide the
model with additional data modalities that complements the knowledge obtained
through text. We investigate the use of visual data to complement the knowledge
of large language models by proposing a method for evaluating visual knowledge
transfer to text for uni- or multimodal language models. The method is based on
two steps, 1) a novel task querying for knowledge of memory colors, i.e.
typical colors of well-known objects, and 2) filtering of model training data
to clearly separate knowledge contributions. Additionally, we introduce a model
architecture that involves a visual imagination step and evaluate it with our
proposed method. We find that our method can successfully be used to measure
visual knowledge transfer capabilities in models and that our novel model
architecture shows promising results for leveraging multimodal knowledge in a
unimodal setting.
Related papers
- Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning [10.839645156881573]
We introduce a novel semi-structured prompting approach that seamlessly integrates the model's parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs.
Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques.
arXiv Detail & Related papers (2023-11-14T19:53:53Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - What do Models Learn From Training on More Than Text? Measuring Visual
Commonsense Knowledge [0.13706331473063876]
We introduce two evaluation tasks for measuring visual commonsense knowledge in language models.
We find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.
arXiv Detail & Related papers (2022-05-14T13:37:50Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.