MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- URL: http://arxiv.org/abs/2408.08521v2
- Date: Fri, 07 Feb 2025 00:09:43 GMT
- Title: MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering
- Authors: Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, Yunyao Li,
- Abstract summary: We introduce a framework named MuRAR (Multimodal Retrieval and Answer Refinement)
MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers.
Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.
- Score: 8.667894505264789
- License:
- Abstract: Recent advancements in retrieval-augmented generation (RAG) have demonstrated impressive performance in the question-answering (QA) task. However, most previous works predominantly focus on text-based answers. While some studies address multimodal data, they still fall short in generating comprehensive multimodal answers, particularly for explaining concepts or providing step-by-step tutorials on how to accomplish specific goals. This capability is especially valuable for applications such as enterprise chatbots and settings such as customer service and educational systems, where the answers are sourced from multimodal data. In this paper, we introduce a simple and effective framework named MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR enhances text-based answers by retrieving relevant multimodal data and refining the responses to create coherent multimodal answers. This framework can be easily extended to support multimodal answers in enterprise chatbots with minimal modifications. Human evaluation results indicate that multimodal answers generated by MuRAR are more useful and readable compared to plain text answers.
Related papers
- Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding [11.004677535859342]
We introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task.
MMCQ combines text and visual modalities to refine user queries in a multi-turn conversation.
We show that multi-turn multi-modal clarification outperforms uni-modal and single-turn approaches, improving MRR by 12.88%.
arXiv Detail & Related papers (2025-02-17T04:58:14Z) - UniCoRN: Unified Commented Retrieval Network with LMMs [5.622291796324221]
We introduce UniCoRN, a Unified Commented Retrieval Network that combines composed multimodal retrieval methods and generative language approaches.
We show improvements of +4.5% recall over the state of the art for composed multimodal retrieval and of +14.9% METEOR / +18.4% BEM over RAG for commenting in CoR.
arXiv Detail & Related papers (2025-02-12T09:49:43Z) - MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval [26.585985828583304]
We introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment.
Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations.
Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)
We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z) - AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit [59.10281630985958]
In question answering (QA), different questions can be effectively addressed with different answering strategies.
We develop a dynamic method that adaptively selects the most suitable QA strategy for each question.
Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules.
arXiv Detail & Related papers (2024-09-20T12:28:18Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z) - Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning)
The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs.
It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Multimodal Multi-Hop Question Answering Through a Conversation Between
Tools and Efficiently Finetuned Large Language Models [20.52053559484399]
We employ a tool-interacting divide-and-conquer strategy to answer complex multi-hop questions.
To increase the reasoning ability of LLMs, we prompt chatGPT to generate a tool-interacting divide-and-conquer dataset.
To assess the effectiveness of this approach, we conduct an evaluation on two recently introduced complex question-answering datasets.
arXiv Detail & Related papers (2023-09-16T08:22:22Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.