UMIE: Unified Multimodal Information Extraction with Instruction Tuning
- URL: http://arxiv.org/abs/2401.03082v1
- Date: Fri, 5 Jan 2024 22:52:15 GMT
- Title: UMIE: Unified Multimodal Information Extraction with Instruction Tuning
- Authors: Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou
- Abstract summary: We propose UMIE, a unified multimodal information extractor, to unify three MIE tasks as a generation problem using instruction tuning.
Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks.
Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain.
- Score: 12.777967562175437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal information extraction (MIE) gains significant attention as the
popularity of multimedia content increases. However, current MIE methods often
resort to using task-specific model structures, which results in limited
generalizability across tasks and underutilizes shared knowledge across MIE
tasks. To address these issues, we propose UMIE, a unified multimodal
information extractor to unify three MIE tasks as a generation problem using
instruction tuning, being able to effectively extract both textual and visual
mentions. Extensive experiments show that our single UMIE outperforms various
state-of-the-art (SoTA) methods across six MIE datasets on three tasks.
Furthermore, in-depth analysis demonstrates UMIE's strong generalization in the
zero-shot setting, robustness to instruction variants, and interpretability.
Our research serves as an initial step towards a unified MIE model and
initiates the exploration into both instruction tuning and large language
models within the MIE domain. Our code, data, and model are available at
https://github.com/ZUCC-AI/UMIE
Related papers
- RUIE: Retrieval-based Unified Information Extraction using Large Language Model [6.788855739199981]
Unified information extraction aims to complete all information extraction tasks using a single model or framework.
We propose RUIE (Retrieval-based Unified Information Extraction), a framework that leverages in-context learning to enable rapid generalization.
Experimental results on 8 held-out datasets demonstrate RUIE's effectiveness in generalizing to unseen tasks.
arXiv Detail & Related papers (2024-09-18T03:20:04Z) - UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model [11.885204227946549]
We propose a comprehensive model designed to represent various tasks using a unified representation.
Our model exhibits strong capabilities in comprehending the implicit intent of user instructions.
Our approach exhibits exceptional scalability and generality.
arXiv Detail & Related papers (2024-08-05T14:27:39Z) - HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Multimodal Question Answering for Unified Information Extraction [15.798187192290746]
Multimodal information extraction aims to extract structured information from unstructured multimedia content.
Most current MIE models are task-specific and data-intensive.
We propose a novel multimodal question answering (MQA) framework to unify three MIE tasks.
arXiv Detail & Related papers (2023-10-04T17:58:05Z) - UniDoc: A Universal Large Multimodal Model for Simultaneous Text
Detection, Recognition, Spotting and Understanding [93.92313947913831]
We introduce UniDoc, a novel multimodal model equipped with text detection and recognition capabilities.
To the best of our knowledge, this is the first large multimodal model capable of simultaneous text detection, recognition, spotting, and understanding.
arXiv Detail & Related papers (2023-08-19T17:32:34Z) - Universal Information Extraction with Meta-Pretrained Self-Retrieval [39.69130086395689]
Universal Information Extraction(Universal IE) aims to solve different extraction tasks in a uniform text-to-structure generation manner.
Retrieving knowledge from external knowledge bases may help models to overcome this problem but it is impossible to construct a knowledge base suitable for various IE tasks.
We propose MetaRetriever to retrieve task-specific knowledge from PLMs to enhance universal IE.
arXiv Detail & Related papers (2023-06-18T00:16:00Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling
for Many-to-Many Multimodal Summarization [113.72253589338472]
Many-to-many multimodal summarization (M$3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence.
We propose a dual knowledge distillation and target-oriented vision modeling framework for the M$3$S task.
arXiv Detail & Related papers (2023-05-22T06:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.