Exploring the Capabilities of Large Multimodal Models on Dense Text
- URL: http://arxiv.org/abs/2405.06706v1
- Date: Thu, 9 May 2024 07:47:25 GMT
- Title: Exploring the Capabilities of Large Multimodal Models on Dense Text
- Authors: Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai,
- Abstract summary: We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
- Score: 58.82262549456294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.
Related papers
- MMSci: A Multimodal Multi-Discipline Dataset for PhD-Level Scientific Comprehension [59.41495657570397]
We collected a multimodal, multidisciplinary dataset from open-access scientific articles published in Nature Communications journals.
This dataset spans 72 scientific disciplines, ensuring both diversity and quality.
We created benchmarks with various tasks and settings to comprehensively evaluate LMMs' capabilities in understanding scientific figures and content.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation [20.106207598099363]
We introduce CoMM, a high-quality dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content.
CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling.
Various quality evaluation metrics are designed to prove the high quality of the filtered dataset.
arXiv Detail & Related papers (2024-06-15T01:27:58Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation [15.343028838291078]
We propose MMIDR, a framework designed to teach LLMs in providing fluent and high-quality textual explanations for their decision-making process of multimodal misinformation.
To convert multimodal misinformation into an appropriate instruction-following format, we present a data augmentation perspective and pipeline.
Furthermore, we design an efficient knowledge distillation approach to distill the capability of proprietary LLMs in explaining multimodal misinformation into open-source LLMs.
arXiv Detail & Related papers (2024-03-21T06:47:28Z) - An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models [116.50367506746713]
We present an empirical study of scaling LLaVA up to 33B and 65B/70B.
We find that scaling LMM consistently enhances model performance and improves language capabilities.
We hope that this study makes state-of-the-art LMM research at a larger scale more accessible.
arXiv Detail & Related papers (2023-09-18T17:30:46Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Multimodal Entity Tagging with Multimodal Knowledge Base [45.84732232595781]
We propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB)
In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair.
We conduct extensive experiments and make analyses on the experimental results.
arXiv Detail & Related papers (2021-12-21T15:04:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.