MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering
over Text, Tables and Images
- URL: http://arxiv.org/abs/2309.04790v1
- Date: Sat, 9 Sep 2023 13:35:01 GMT
- Title: MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering
over Text, Tables and Images
- Authors: Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao and
Kang Liu
- Abstract summary: In-context learning has become the most popular way to solve QA problems.
We propose MMHQA-ICL framework for addressing this problems.
We are the first to use end-to-end prompting method for this task.
- Score: 24.17147521556083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the real world, knowledge often exists in a multimodal and heterogeneous
form. Addressing the task of question answering with hybrid data types,
including text, tables, and images, is a challenging task (MMHQA). Recently,
with the rise of large language models (LLM), in-context learning (ICL) has
become the most popular way to solve QA problems. We propose MMHQA-ICL
framework for addressing this problems, which includes stronger heterogeneous
data retriever and an image caption module. Most importantly, we propose a
Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage
their powerful performance in this task. We are the first to use end-to-end LLM
prompting method for this task. Experimental results demonstrate that our
framework outperforms all baselines and methods trained on the full dataset,
achieving state-of-the-art results under the few-shot setting on the
MultimodalQA dataset.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely [8.507599833330346]
Large language models (LLMs) augmented with external data have demonstrated remarkable capabilities in completing real-world tasks.
Retrieval-Augmented Generation (RAG) and fine-tuning are gaining increasing attention and widespread application.
However, the effective deployment of data-augmented LLMs across various specialized fields presents substantial challenges.
arXiv Detail & Related papers (2024-09-23T11:20:20Z) - UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data [29.07028542633284]
Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus.
There is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems.
In this paper, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data.
arXiv Detail & Related papers (2024-02-20T10:00:58Z) - Can MLLMs Perform Text-to-Image In-Context Learning? [11.303734988815016]
The Text-to-Image ICL (T2I-ICL) with its unique characteristics and potential applications remains underexplored.
We benchmark six state-of-the-art Multimodal Large Language Models (MLLMs)
We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties.
arXiv Detail & Related papers (2024-02-02T10:30:05Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - An In-Context Schema Understanding Method for Knowledge Base Question
Answering [70.87993081445127]
Large Language Models (LLMs) have shown strong capabilities in language understanding and can be used to solve this task.
Existing methods bypass this challenge by initially employing LLMs to generate drafts of logic forms without schema-specific details.
We propose a simple In-Context Understanding (ICSU) method that enables LLMs to directly understand schemas by leveraging in-context learning.
arXiv Detail & Related papers (2023-10-22T04:19:17Z) - LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models [21.95962189710859]
We propose a lightweight, end-to-end framework to execute the Spoken Question Answering (SQA) task on the LibriSQA dataset.
By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks.
Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.
arXiv Detail & Related papers (2023-08-20T23:47:23Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.