MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
- URL: http://arxiv.org/abs/2311.10774v2
- Date: Mon, 15 Apr 2024 15:48:48 GMT
- Title: MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
- Authors: Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, Dong Yu,
- Abstract summary: A gap remains in the domain of chart image understanding due to the distinct abstract components in charts.
We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types.
We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
- Score: 48.63002688222462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has been impressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chart image understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal Chart Instruction (\textbf{MMC-Instruction}) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we develop MultiModal Chart Assistant (\textbf{MMCA}), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (\textbf{MMC-Benchmark}), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts. Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the most recent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding of charts. Code and data are available at https://github.com/FuxiaoLiu/MMC.
Related papers
- MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems [18.188725200923333]
Existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios.
We introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning.
Our results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field.
arXiv Detail & Related papers (2024-10-18T05:15:50Z) - ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding [34.750274110196116]
We propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector.
Specifically, we train multiple linear connectors through distinct alignment tasks.
We also introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks.
arXiv Detail & Related papers (2024-09-05T06:41:02Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation [42.945960365307485]
We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs)
ChartMimic includes 1,000 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains.
Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs' capacity to harmonize a blend of cognitive capabilities.
arXiv Detail & Related papers (2024-06-14T12:10:51Z) - TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters.
TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z) - ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning [54.82612435284695]
We benchmark the ability of off-the-shelf Multi-modal Large Language Models (MLLMs) in the chart domain.
We construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data.
We develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns.
arXiv Detail & Related papers (2024-02-19T14:48:23Z) - ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation.
Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics.
We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z) - ChartLlama: A Multimodal LLM for Chart Understanding and Generation [70.1393163657813]
We create a high-quality instruction-tuning dataset leveraging GPT-4.
Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset.
arXiv Detail & Related papers (2023-11-27T15:20:23Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.