Related papers: MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model

URL: http://arxiv.org/abs/2310.13265v1
Date: Fri, 20 Oct 2023 04:09:36 GMT
Title: MoqaGPT : Zero-Shot Multi-modal Open-domain Question Answering with Large Language Model
Authors: Le Zhang, Yihong Wu, Fengran Mo, Jian-Yun Nie, Aishwarya Agrawal
Abstract summary: MoqaGPT is a framework for multi-modal open-domain question answering. It retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods.
Score: 33.546564412022754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large Language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MoqaGPT, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MoqaGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MoqaGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods. Our codebase is available at https://github.com/lezhang7/MOQAGPT.

Related papers

RETLLM: Training and Data-Free MLLMs for Multimodal Information Retrieval [2.2125276321198677]
Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates.<n>Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework.<n>We introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training- and data-free manner.
arXiv Detail & Related papers (2026-02-25T10:31:32Z)
FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data [60.09659670497899]
Large language models (LLMs) have a diverse landscape of models, each excelling at different tasks.<n>This diversity drives researchers to employ multiple LLMs in practice, leaving behind valuable multi-LLM log data.<n>We argue that practical fusion must meet two essential requirements: (1) compatibility with real-world serving scenarios (e.g., local and API-based serving), and (2) flexibility to operate at different stages of the LLM pipeline to meet varied user needs.
arXiv Detail & Related papers (2025-07-14T17:58:02Z)
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models [44.32482918853282]
There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning.<n>We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K)<n>In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets.
arXiv Detail & Related papers (2025-05-30T03:11:46Z)
GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable [1.79487674052027]
We propose a framework to take full advantages of existing large language foundation models (LLM) We train an additional branch of transformer blocks on the final-layer embedding of pretrained LLMs, which is the base, then a carry-on module merge the base models to compose a customized LLM. As the base model don't need to update parameters, we are able to outsource most computation of the training job on inference nodes.
arXiv Detail & Related papers (2025-04-10T07:15:40Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z)
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering [48.55956886819481]
We introduce a modular multi-LMM agent framework based on several agents with different roles. Specifically, we propose TraveLER, a method that can create a plan to "Traverse" through the video. We find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets.
arXiv Detail & Related papers (2024-04-01T20:58:24Z)
TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task. We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z)
Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs. We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z)
Generative Multimodal Entity Linking [24.322540112710918]
Multimodal Entity Linking (MEL) is the task of mapping mentions with multimodal contexts to referent entities from a knowledge base. Existing MEL methods mainly focus on designing complex multimodal interaction mechanisms and require fine-tuning all model parameters. We propose GEMEL, a Generative Multimodal Entity Linking framework based on Large Language Models (LLMs) Our framework is compatible with any off-the-shelf language model, paving the way towards an efficient and general solution.
arXiv Detail & Related papers (2023-06-22T07:57:19Z)
Enhancing In-Context Learning with Answer Feedback for Multi-Span Question Answering [9.158919909909146]
In this paper, we propose a novel way of employing labeled data such as it informs LLM of some undesired output. Experiments on three multi-span question answering datasets and a keyphrase extraction dataset show that our new prompting strategy consistently improves LLM's in-context learning performance.
arXiv Detail & Related papers (2023-06-07T15:20:24Z)
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA [67.08732962244301]
Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models. We propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of Large Language Models.
arXiv Detail & Related papers (2022-12-16T18:23:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.