Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
- URL: http://arxiv.org/abs/2412.13540v2
- Date: Tue, 18 Feb 2025 04:00:30 GMT
- Title: Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
- Authors: Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Jun Yu, Min Zhang,
- Abstract summary: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks.
Recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs.
We propose VGCure, a benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs.
- Score: 33.85555387495046
- License:
- Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs' performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.
Related papers
- Are Large Language Models In-Context Graph Learners? [31.172657860606297]
Large language models (LLMs) have remarkable in-context reasoning capabilities across a wide range of tasks.
However, they struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures.
We show that learning on graph data can be conceptualized as a retrieval-augmented generation (RAG) process.
We propose a series of RAG frameworks to enhance the in-context learning capabilities of LLMs for graph learning tasks.
arXiv Detail & Related papers (2025-02-19T09:14:19Z) - Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.
They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.
We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - Revisiting the Graph Reasoning Ability of Large Language Models: Case Studies in Translation, Connectivity and Shortest Path [53.71787069694794]
We focus on the graph reasoning ability of Large Language Models (LLMs)
We revisit the ability of LLMs on three fundamental graph tasks: graph description translation, graph connectivity, and the shortest-path problem.
Our findings suggest that LLMs can fail to understand graph structures through text descriptions and exhibit varying performance for all these fundamental tasks.
arXiv Detail & Related papers (2024-08-18T16:26:39Z) - In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data.
We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses.
Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z) - Learning on Graphs with Large Language Models(LLMs): A Deep Dive into Model Robustness [39.57155321515097]
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing tasks.
It remains unclear whether LLMs exhibit robustness in learning on graphs.
arXiv Detail & Related papers (2024-07-16T09:05:31Z) - Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs [11.19928977117624]
Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts.
Various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts.
These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts.
This paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks.
arXiv Detail & Related papers (2024-06-01T01:43:30Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks.
We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs)
QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment.
Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.
MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Disentangled Representation Learning with Large Language Models for
Text-Attributed Graphs [57.052160123387104]
We present the Disentangled Graph-Text Learner (DGTL) model, which is able to enhance the reasoning and predicting capabilities of LLMs for TAGs.
Our proposed DGTL model incorporates graph structure information through tailored disentangled graph neural network (GNN) layers.
Experimental evaluations demonstrate the effectiveness of the proposed DGTL model on achieving superior or comparable performance over state-of-the-art baselines.
arXiv Detail & Related papers (2023-10-27T14:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.