Exploring the Capabilities of Large Multimodal Models on Dense Text
- URL: http://arxiv.org/abs/2405.06706v1
- Date: Thu, 9 May 2024 07:47:25 GMT
- Title: Exploring the Capabilities of Large Multimodal Models on Dense Text
- Authors: Shuo Zhang, Biao Yang, Zhang Li, Zhiyin Ma, Yuliang Liu, Xiang Bai,
- Abstract summary: We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
- Score: 58.82262549456294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large multi-modal models (LMM) have shown notable progress in multi-modal tasks, their capabilities in tasks involving dense textual content remains to be fully explored. Dense text, which carries important information, is often found in documents, tables, and product descriptions. Understanding dense text enables us to obtain more accurate information, assisting in making better decisions. To further explore the capabilities of LMM in complex text tasks, we propose the DT-VQA dataset, with 170k question-answer pairs. In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs on our dataset, revealing their strengths and weaknesses. Furthermore, we evaluate the effectiveness of two strategies for LMM: prompt engineering and downstream fine-tuning. We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved. We hope that this research will promote the study of LMM in dense text tasks. Code will be released at https://github.com/Yuliang-Liu/MultimodalOCR.
Related papers
- LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities.
The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks.
We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z) - ModalPrompt:Dual-Modality Guided Prompt for Continual Learning of Large Multimodal Models [40.7613157799378]
Large Multimodal Models (LMMs) exhibit remarkable multi-tasking ability by learning mixed datasets jointly.
Existing methods leverage data replay or model expansion, both of which are not specially developed for LMMs.
We propose a novel dual-modality guided prompt learning framework (ModalPrompt) tailored for multimodal continual learning.
arXiv Detail & Related papers (2024-10-08T09:35:37Z) - MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines [91.08394877954322]
Large Multimodal Models (LMMs) have made impressive strides in AI search engines.
But, whether they can function as AI search engines remains under-explored.
We first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities.
arXiv Detail & Related papers (2024-09-19T17:59:45Z) - Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents.
Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning.
We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z) - An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models [116.50367506746713]
We present an empirical study of scaling LLaVA up to 33B and 65B/70B.
We find that scaling LMM consistently enhances model performance and improves language capabilities.
We hope that this study makes state-of-the-art LMM research at a larger scale more accessible.
arXiv Detail & Related papers (2023-09-18T17:30:46Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - Multimodal Entity Tagging with Multimodal Knowledge Base [45.84732232595781]
We propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB)
In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair.
We conduct extensive experiments and make analyses on the experimental results.
arXiv Detail & Related papers (2021-12-21T15:04:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.