mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model
- URL: http://arxiv.org/abs/2311.18248v2
- Date: Tue, 9 Jan 2024 12:07:02 GMT
- Title: mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model
- Authors: Anwen Hu, Yaya Shi, Haiyang Xu, Jiabo Ye, Qinghao Ye, Ming Yan,
Chenliang Li, Qi Qian, Ji Zhang, Fei Huang
- Abstract summary: This work focuses on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.
By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper.
M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes.
- Score: 73.38800189095173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the strong text creation ability of Large Language Models(LLMs) has
given rise to many tools for assisting paper reading or even writing. However,
the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit
their application scenarios, especially for scientific academic paper writing.
In this work, towards a more versatile copilot for academic paper writing, we
mainly focus on strengthening the multi-modal diagram analysis ability of
Multimodal LLMs. By parsing Latex source files of high-quality papers, we
carefully build a multi-modal diagram understanding dataset M-Paper. By
aligning diagrams in the paper with related paragraphs, we construct
professional diagram analysis samples for training and evaluation. M-Paper is
the first dataset to support joint comprehension of multiple scientific
diagrams, including figures and tables in the format of images or Latex codes.
Besides, to better align the copilot with the user's intention, we introduce
the `outline' as the control signal, which could be directly given by the user
or revised based on auto-generated ones. Comprehensive experiments with a
state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows
stronger scientific diagram understanding performance, including diagram
captioning, diagram analysis, and outline recommendation. The dataset, code,
and model are available at
https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.
Related papers
- Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Large Language Models on Graphs: A Comprehensive Survey [77.16803297418201]
We provide a systematic review of scenarios and techniques related to large language models on graphs.
We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-attributed graphs, and text-paired graphs.
We discuss the real-world applications of such methods and summarize open-source codes and benchmark datasets.
arXiv Detail & Related papers (2023-12-05T14:14:27Z) - Integrating Graphs with Large Language Models: Methods and Prospects [68.37584693537555]
Large language models (LLMs) have emerged as frontrunners, showcasing unparalleled prowess in diverse applications.
Merging the capabilities of LLMs with graph-structured data has been a topic of keen interest.
This paper bifurcates such integrations into two predominant categories.
arXiv Detail & Related papers (2023-10-09T07:59:34Z) - Beyond Text: A Deep Dive into Large Language Models' Ability on
Understanding Graph Data [13.524529952170672]
Large language models (LLMs) have achieved impressive performance on many natural language processing tasks.
We aim to assess whether LLMs can effectively process graph data and leverage topological structures to enhance performance.
By comparing LLMs' performance with specialized graph models, we offer insights into the strengths and limitations of employing LLMs for graph analytics.
arXiv Detail & Related papers (2023-10-07T23:25:22Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.