Related papers: FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

URL: http://arxiv.org/abs/2407.05183v2
Date: Tue, 9 Jul 2024 21:16:00 GMT
Title: FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding
Authors: Huitong Pan, Qi Zhang, Cornelia Caragea, Eduard Dragut, Longin Jan Latecki,
Abstract summary: FlowLearn dataset is a resource tailored to enhance the understanding of flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature. The simulated subset contains 10,000 flowcharts created using a customizable script.
Score: 52.35520385083425
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains 10,000 flowcharts created using a customizable script. The dataset is enriched with annotations for visual components, OCR, Mermaid code representation, and VQA question-answer pairs. Despite the proven capabilities of Large Vision-Language Models (LVLMs) in various visual understanding tasks, their effectiveness in decoding flowcharts - a crucial element of scientific communication - has yet to be thoroughly investigated. The FlowLearn test set is crafted to assess the performance of LVLMs in flowchart comprehension. Our study thoroughly evaluates state-of-the-art LVLMs, identifying existing limitations and establishing a foundation for future enhancements in this relatively underexplored domain. For instance, in tasks involving simulated flowcharts, GPT-4V achieved the highest accuracy (58%) in counting the number of nodes, while Claude recorded the highest accuracy (83%) in OCR tasks. Notably, no single model excels in all tasks within the FlowLearn framework, highlighting significant opportunities for further development.

Related papers

Follow the Flow: Fine-grained Flowchart Attribution with Neurosymbolic Agents [106.04963073116468]
Flowcharts are a critical tool for visualizing decision-making processes.<n> vision-language models frequently hallucinate nonexistent connections and decision paths when analyzing these diagrams.<n>We introduce Fine-grained Flowchart, which traces specific components grounding a flowchart referring LLM response.<n>We propose FlowPathAgent, a neurosymbolic agent that performs fine-grained post hoc attribution through graph-based reasoning.
arXiv Detail & Related papers (2025-06-02T06:02:41Z)
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
ChartAdapter: Large Vision-Language Model for Chart Summarization [13.499376163294816]
ChartAdapter is a lightweight transformer module designed to bridge the gap between charts and textual summaries. By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization.
arXiv Detail & Related papers (2024-12-30T05:07:34Z)
Beyond End-to-End VLMs: Leveraging Intermediate Text Representations for Superior Flowchart Understanding [9.267156820352996]
Flowcharts are typically presented as images, driving the trend of using vision-language models (VLMs) for end-to-end flowchart understanding. Two key challenges arise: (i) Limited controllability--users have minimal influence over the downstream task, while the training of VLMs is often out of reach. We propose TextFlow, addressing aforementioned issues with two stages: Vision Textualizer and Textual Reasoner.
arXiv Detail & Related papers (2024-12-21T00:52:41Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs) We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. We also present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms. We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension [53.6373473053431]
This work introduces a benchmark to assess large language models' capabilities in graph pattern tasks. We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions. Our benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models.
arXiv Detail & Related papers (2024-10-04T04:48:33Z)
Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA) Recent efforts primarily focus on scaling up training datasets through data collection and synthesis. We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models [0.34952465649465553]
We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. We find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97.
arXiv Detail & Related papers (2024-06-14T14:15:35Z)
Parameter-Efficient Tuning Large Language Models for Graph Representation Learning [62.26278815157628]
We introduce Graph-aware. Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning. We use a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt. We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations.
arXiv Detail & Related papers (2024-04-28T18:36:59Z)
Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks. Our method achieves state-of-the-art results on well-established TAG datasets. Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z)
A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction [4.572330678291241]
We develop a unified active learning framework specializing in software performance prediction. We investigate the impact of using different levels of information for active and passive learning. Our approach aims to improve the investment in AI models for different software performance predictions.
arXiv Detail & Related papers (2023-04-06T14:00:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.