Multimodal Question Answering for Unified Information Extraction
- URL: http://arxiv.org/abs/2310.03017v1
- Date: Wed, 4 Oct 2023 17:58:05 GMT
- Title: Multimodal Question Answering for Unified Information Extraction
- Authors: Yuxuan Sun, Kai Zhang, Yu Su
- Abstract summary: Multimodal information extraction aims to extract structured information from unstructured multimedia content.
Most current MIE models are task-specific and data-intensive.
We propose a novel multimodal question answering (MQA) framework to unify three MIE tasks.
- Score: 15.798187192290746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal information extraction (MIE) aims to extract structured
information from unstructured multimedia content. Due to the diversity of tasks
and settings, most current MIE models are task-specific and data-intensive,
which limits their generalization to real-world scenarios with diverse task
requirements and limited labeled data. To address these issues, we propose a
novel multimodal question answering (MQA) framework to unify three MIE tasks by
reformulating them into a unified span extraction and multi-choice QA pipeline.
Extensive experiments on six datasets show that: 1) Our MQA framework
consistently and significantly improves the performances of various
off-the-shelf large multimodal models (LMM) on MIE tasks, compared to vanilla
prompting. 2) In the zero-shot setting, MQA outperforms previous
state-of-the-art baselines by a large margin. In addition, the effectiveness of
our framework can successfully transfer to the few-shot setting, enhancing LMMs
on a scale of 10B parameters to be competitive or outperform much larger
language models such as ChatGPT and GPT-4. Our MQA framework can serve as a
general principle of utilizing LMMs to better solve MIE and potentially other
downstream multimodal tasks.
Related papers
- Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - Simplifying Multimodality: Unimodal Approach to Multimodal Challenges in Radiology with General-Domain Large Language Model [3.012719451477384]
We introduce MID-M, a novel framework that leverages the in-context learning capabilities of a general-domain Large Language Model (LLM) to process multimodal data via image descriptions.
MID-M achieves a comparable or superior performance to task-specific fine-tuned LMMs and other general-domain ones, without the extensive domain-specific training or pre-training on multimodal data.
The robustness of MID-M against data quality issues demonstrates its practical utility in real-world medical domain applications.
arXiv Detail & Related papers (2024-04-29T13:23:33Z) - Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding [7.329728566839757]
We propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF)
MoPE-BAF is a novel multi-modal soft prompt framework based on the unified vision-language model (VLM)
arXiv Detail & Related papers (2024-03-17T19:12:26Z) - Multimodal Instruction Tuning with Conditional Mixture of LoRA [54.65520214291653]
This paper introduces a novel approach that integrates multimodal instruction tuning with Low-Rank Adaption (LoRA)
It innovates upon LoRA by dynamically constructing low-rank adaptation matrices tailored to the unique demands of each input instance.
Experimental results on various multimodal evaluation datasets indicate that MixLoRA not only outperforms the conventional LoRA with the same or even higher ranks.
arXiv Detail & Related papers (2024-02-24T20:15:31Z) - UMIE: Unified Multimodal Information Extraction with Instruction Tuning [12.777967562175437]
We propose UMIE, a unified multimodal information extractor, to unify three MIE tasks as a generation problem using instruction tuning.
Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks.
Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain.
arXiv Detail & Related papers (2024-01-05T22:52:15Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.