REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
Multi-Source Multimodal Knowledge Memory
- URL: http://arxiv.org/abs/2212.05221v2
- Date: Mon, 3 Apr 2023 08:32:39 GMT
- Title: REVEAL: Retrieval-Augmented Visual-Language Pre-Training with
Multi-Source Multimodal Knowledge Memory
- Authors: Ziniu Hu and Ahmet Iscen and Chen Sun and Zirui Wang and Kai-Wei Chang
and Yizhou Sun and Cordelia Schmid and David A. Ross and Alireza Fathi
- Abstract summary: We propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL)
It learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries.
A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data.
- Score: 119.98011559193574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose an end-to-end Retrieval-Augmented Visual Language
Model (REVEAL) that learns to encode world knowledge into a large-scale memory,
and to retrieve from it to answer knowledge-intensive queries. REVEAL consists
of four key components: the memory, the encoder, the retriever and the
generator. The large-scale memory encodes various sources of multimodal world
knowledge (e.g. image-text pairs, question answering pairs, knowledge graph
triplets, etc) via a unified encoder. The retriever finds the most relevant
knowledge entries in the memory, and the generator fuses the retrieved
knowledge with the input query to produce the output. A key novelty in our
approach is that the memory, encoder, retriever and generator are all
pre-trained end-to-end on a massive amount of data. Furthermore, our approach
can use a diverse set of multimodal knowledge sources, which is shown to result
in significant gains. We show that REVEAL achieves state-of-the-art results on
visual question answering and image captioning.
Related papers
- MATTER: Memory-Augmented Transformer Using Heterogeneous Knowledge Sources [12.783393023641505]
We introduce an efficient memory-augmented transformer called MATTER.
MATTER retrieves and reads from both unstructured sources (paragraphs) and semi-structured sources (QA pairs) in the form of fixed-length neural memories.
We demonstrate that our model outperforms existing efficient retrieval-augmented models on popular QA benchmarks in terms of both accuracy and speed.
arXiv Detail & Related papers (2024-06-07T06:35:37Z) - Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual
Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions.
Specifically, a well-defined detector is adopted to predict image-question related relation phrases.
The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z) - Multi-Grained Knowledge Retrieval for End-to-End Task-Oriented Dialog [42.088274728084265]
Retrieving proper domain knowledge from an external database lies at the heart of end-to-end task-oriented dialog systems.
Most existing systems blend knowledge retrieval with response generation and optimize them with direct supervision from reference responses.
We propose to decouple knowledge retrieval from response generation and introduce a multi-grained knowledge retriever.
arXiv Detail & Related papers (2023-05-17T12:12:46Z) - Active Retrieval Augmented Generation [123.68874416084499]
Augmenting large language models (LMs) by retrieving information from external knowledge resources is one promising solution.
Most existing retrieval augmented LMs employ a retrieve-and-generate setup that only retrieves information once based on the input.
We propose Forward-Looking Active REtrieval augmented generation (FLARE), a generic method which iteratively uses a prediction of the upcoming sentence to anticipate future content.
arXiv Detail & Related papers (2023-05-11T17:13:40Z) - An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP
Tasks [40.81306982129298]
Parametric and retrieval-augmented models have complementary strengths in terms of computational efficiency and predictive accuracy.
We propose the Efficient Memory-Augmented Transformer (EMAT)
It encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying.
arXiv Detail & Related papers (2022-10-30T08:34:49Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process.
Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens.
Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z) - Mention Memory: incorporating textual knowledge into Transformers
through entity mention attention [21.361822569279003]
We propose to integrate a semi-parametric representation of a large text corpus into a Transformer model as a source of factual knowledge.
The proposed model - TOME - is a Transformer that accesses the information through internal memory layers in which each entity mention in the input passage attends to the mention memory.
In experiments using a memory of 150 million Wikipedia mentions, TOME achieves strong performance on several open-domain knowledge-intensive tasks.
arXiv Detail & Related papers (2021-10-12T17:19:05Z) - Towards a Universal Continuous Knowledge Base [49.95342223987143]
We propose a method for building a continuous knowledge base that can store knowledge imported from multiple neural networks.
Experiments on text classification show promising results.
We import the knowledge from multiple models to the knowledge base, from which the fused knowledge is exported back to a single model.
arXiv Detail & Related papers (2020-12-25T12:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.