MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text
- URL: http://arxiv.org/abs/2210.02928v1
- Date: Thu, 6 Oct 2022 13:58:03 GMT
- Title: MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text
- Authors: Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, William W. Cohen
- Abstract summary: We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
- Score: 58.655375327681774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While language Models store a massive amount of world knowledge implicitly in
their parameters, even very large models often fail to encode information about
rare entities and events, while incurring huge computational costs. Recently,
retrieval-augmented models, such as REALM, RAG, and RETRO, have incorporated
world knowledge into language generation by leveraging an external
non-parametric index and have demonstrated impressive performance with
constrained model sizes. However, these methods are restricted to retrieving
only textual knowledge, neglecting the ubiquitous amount of knowledge in other
modalities like images -- much of which contains information not covered by any
text. To address this limitation, we propose the first Multimodal
Retrieval-Augmented Transformer (MuRAG), which accesses an external
non-parametric multimodal memory to augment language generation. MuRAG is
pre-trained with a mixture of large-scale image-text and text-only corpora
using a joint contrastive and generative loss. We perform experiments on two
different datasets that require retrieving and reasoning over both images and
text to answer a given query: WebQA, and MultimodalQA. Our results show that
MuRAG achieves state-of-the-art accuracy, outperforming existing models by
10-20\% absolute on both datasets and under both distractor and full-wiki
settings.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Harmonizing Visual Text Comprehension and Generation [31.605599298507293]
We present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text.
We propose Slide-LoRA, which aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space.
Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-07-23T10:11:56Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference.
Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.
arXiv Detail & Related papers (2024-05-16T17:58:45Z) - EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset [20.445453185198186]
We propose a Multimodal Data Construction Framework (MDCF) to alleviate the significant human and resource expenditure in data collection.
MDCF automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability.
Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses.
arXiv Detail & Related papers (2023-10-17T03:28:29Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - MuMUR : Multilingual Multimodal Universal Retrieval [19.242056928318913]
We propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual visual-text pairs.
We then use this data to learn a joint vision-text representation where English and non-English text queries are represented in a common embedding space.
arXiv Detail & Related papers (2022-08-24T13:55:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.