See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models
- URL: http://arxiv.org/abs/2510.16769v1
- Date: Sun, 19 Oct 2025 09:20:44 GMT
- Title: See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models
- Authors: Shuo Han, Yukun Cao, Zezhong Ding, Zengyi Gao, S Kevin Zhou, Xike Xie,
- Abstract summary: We propose a unified framework that enhances both scalability and modality coordination in graph understanding.<n>For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base.<n>For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality.
- Score: 34.29171455515379
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.
Related papers
- GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning [50.40400074353263]
Graph Neural Networks (GNNs) are powerful tools for precessing relational data but often struggle to generalize to unseen graphs.<n>We introduce textbfGraph textbfIn-context textbfL textbfTransformer (GILT), a framework built on an LLM-free and tuning-free architecture.
arXiv Detail & Related papers (2025-10-06T08:09:15Z) - Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models [10.813015912529936]
We introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of Vision-Language Models (VLMs)<n>Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings.<n>We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy.
arXiv Detail & Related papers (2025-03-27T12:20:37Z) - Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models [3.9489815622117566]
Learnable Graph Pooling Token (LGPT) enables flexible and efficient graph representation.<n>Our method achieves a 4.13% performance improvement on the GraphQA benchmark without training the large language model.
arXiv Detail & Related papers (2025-01-29T10:35:41Z) - Graph-Based Multimodal Contrastive Learning for Chart Question Answering [11.828192162922436]
This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures.<n>The framework integrates both visual and textual graphs to capture structural and semantic characteristics.<n>A graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts.
arXiv Detail & Related papers (2025-01-08T06:27:07Z) - A Hierarchical Language Model For Interpretable Graph Reasoning [47.460255447561906]
We introduce Hierarchical Language Model for Graph (HLM-G), which employs a two-block architecture to capture node-centric local information and interaction-centric global structure.
The proposed scheme allows LLMs to address various graph queries with high efficacy, efficiency, and robustness, while reducing computational costs on large-scale graph tasks.
Comprehensive evaluations across diverse graph reasoning and real-world tasks of node, link, and graph-levels highlight the superiority of our method.
arXiv Detail & Related papers (2024-10-29T00:28:02Z) - From Anchors to Answers: A Novel Node Tokenizer for Integrating Graph Structure into Large Language Models [27.353083085394008]
We present NT-LLM, a novel framework with an anchor-based positional encoding scheme for graph representation.<n>Our approach strategically selects reference nodes as anchors and encodes each node's position relative to these anchors, capturing essential topological information without the computational burden of existing methods.<n>By implementing a rank-preserving objective for positional encoding pretraining, NT-LLM achieves superior performance across diverse graph tasks ranging from basic structural analysis to complex reasoning scenarios.
arXiv Detail & Related papers (2024-10-14T17:21:57Z) - Bridging Local Details and Global Context in Text-Attributed Graphs [62.522550655068336]
GraphBridge is a framework that bridges local and global perspectives by leveraging contextual textual information.
Our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
arXiv Detail & Related papers (2024-06-18T13:35:25Z) - GRAG: Graph Retrieval-Augmented Generation [14.98084919101233]
Graph Retrieval-Augmented Generation (GRAG) tackles the fundamental challenges in retrieving textual subgraphs.<n>We propose a novel divide-and-conquer strategy that retrieves the optimal subgraph structure in linear time.<n>Our experiments on graph reasoning benchmarks demonstrate that our GRAG approach significantly outperforms current state-of-the-art RAG methods.
arXiv Detail & Related papers (2024-05-26T10:11:40Z) - When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding
and Reasoning [54.84870836443311]
The paper presents a new paradigm for understanding and reasoning about graph data by integrating image encoding and multimodal technologies.
This approach enables the comprehension of graph data through an instruction-response format, utilizing GPT-4V's advanced capabilities.
The study evaluates this paradigm on various graph types, highlighting the model's strengths and weaknesses, particularly in Chinese OCR performance and complex reasoning tasks.
arXiv Detail & Related papers (2023-12-16T08:14:11Z) - GraphGPT: Graph Instruction Tuning for Large Language Models [27.036935149004726]
Graph Neural Networks (GNNs) have evolved to understand graph structures.
To enhance robustness, self-supervised learning (SSL) has become a vital tool for data augmentation.
Our research tackles this by advancing graph model generalization in zero-shot learning environments.
arXiv Detail & Related papers (2023-10-19T06:17:46Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - CommPOOL: An Interpretable Graph Pooling Framework for Hierarchical
Graph Representation Learning [74.90535111881358]
We propose a new interpretable graph pooling framework - CommPOOL.
It can capture and preserve the hierarchical community structure of graphs in the graph representation learning process.
CommPOOL is a general and flexible framework for hierarchical graph representation learning.
arXiv Detail & Related papers (2020-12-10T21:14:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.