MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering
over Text, Tables and Images
- URL: http://arxiv.org/abs/2309.04790v1
- Date: Sat, 9 Sep 2023 13:35:01 GMT
- Title: MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering
over Text, Tables and Images
- Authors: Weihao Liu, Fangyu Lei, Tongxu Luo, Jiahe Lei, Shizhu He, Jun Zhao and
Kang Liu
- Abstract summary: In-context learning has become the most popular way to solve QA problems.
We propose MMHQA-ICL framework for addressing this problems.
We are the first to use end-to-end prompting method for this task.
- Score: 24.17147521556083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the real world, knowledge often exists in a multimodal and heterogeneous
form. Addressing the task of question answering with hybrid data types,
including text, tables, and images, is a challenging task (MMHQA). Recently,
with the rise of large language models (LLM), in-context learning (ICL) has
become the most popular way to solve QA problems. We propose MMHQA-ICL
framework for addressing this problems, which includes stronger heterogeneous
data retriever and an image caption module. Most importantly, we propose a
Type-specific In-context Learning Strategy for MMHQA, enabling LLMs to leverage
their powerful performance in this task. We are the first to use end-to-end LLM
prompting method for this task. Experimental results demonstrate that our
framework outperforms all baselines and methods trained on the full dataset,
achieving state-of-the-art results under the few-shot setting on the
MultimodalQA dataset.
Related papers
- UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
To better evaluate this setting and facilitate modeling efforts, we introduce TACT - Text And Calculations through Tables.
TACT contains challenging instructions that demand stitching information scattered across one or more texts, and performing complex integration on this information to generate the answer.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data [29.07028542633284]
Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus.
There is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems.
In this paper, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data.
arXiv Detail & Related papers (2024-02-20T10:00:58Z) - Can MLLMs Perform Text-to-Image In-Context Learning? [11.303734988815016]
The Text-to-Image ICL (T2I-ICL) with its unique characteristics and potential applications remains underexplored.
We benchmark six state-of-the-art Multimodal Large Language Models (MLLMs)
We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties.
arXiv Detail & Related papers (2024-02-02T10:30:05Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - What Large Language Models Bring to Text-rich VQA? [38.569505870771025]
Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition.
To address the above concern, we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts.
This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.
arXiv Detail & Related papers (2023-11-13T12:52:29Z) - An In-Context Schema Understanding Method for Knowledge Base Question
Answering [70.87993081445127]
Large Language Models (LLMs) have shown strong capabilities in language understanding and can be used to solve this task.
Existing methods bypass this challenge by initially employing LLMs to generate drafts of logic forms without schema-specific details.
We propose a simple In-Context Understanding (ICSU) method that enables LLMs to directly understand schemas by leveraging in-context learning.
arXiv Detail & Related papers (2023-10-22T04:19:17Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models [21.95962189710859]
We propose a lightweight, end-to-end framework to execute the Spoken Question Answering (SQA) task on the LibriSQA dataset.
By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks.
Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.
arXiv Detail & Related papers (2023-08-20T23:47:23Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.