Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
- URL: http://arxiv.org/abs/2603.02865v1
- Date: Tue, 03 Mar 2026 11:17:31 GMT
- Title: Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
- Authors: Haruto Yoshida, Keito Kudo, Yoichi Aoki, Ryota Tanaka, Itsumi Saito, Keisuke Sakaguchi, Kentaro Inui,
- Abstract summary: We probe the internal representation of large vision-language models (LVLMs) using a synthetic diagram dataset based on directed graphs.<n>Our experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model.<n>These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information.
- Score: 32.05060138278358
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (LVLMs) demonstrate strong performance on diagram understanding benchmarks, yet they still struggle with understanding relationships between elements, particularly those represented by nodes and directed edges (e.g., arrows and lines). To investigate the underlying causes of this limitation, we probe the internal representation of LVLMs using a carefully constructed synthetic diagram dataset based on directed graphs. Our probing experiments reveal that edge information is not linearly separable in the vision encoder and becomes linearly encoded only in the text tokens in the language model. In contrast, node information and global structural features are already linearly encoded in individual hidden states of the vision encoder. These findings suggest that the stage at which linearly separable representations are formed varies depending on the type of visual information. In particular, the delayed emergence of edge representations may help explain why LVLMs struggle with relational understanding, such as interpreting edge directions, which require more abstract, compositionally integrated processes.
Related papers
- Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation [6.004292247258359]
We show how to densely align images with synthetic descriptions generated by generative vision-language models.<n>Our approach outperforms prior work on standard zero-shot open-vocabulary segmentation benchmarks/datasets.
arXiv Detail & Related papers (2025-09-15T12:26:47Z) - From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context [2.66757978610454]
LOGIC is a lightweight, post-hoc framework that uses large language models to generate faithful and interpretable explanations for GNN predictions.<n>Our experiments demonstrate that LOGIC achieves a favorable trade-off between fidelity and sparsity, while significantly improving human-centric metrics such as insightfulness.
arXiv Detail & Related papers (2025-08-09T23:22:38Z) - Revisit What You See: Disclose Language Prior in Vision Tokens for LVLM Decoding [6.612630497074871]
Large Vision-Language Models (LVLMs) achieve strong performance across multimodal tasks by integrating visual perception with language understanding.<n>We propose ReVisiT, a training-free decoding method that references vision tokens to guide text generation.
arXiv Detail & Related papers (2025-06-11T08:46:55Z) - Can Visual Encoder Learn to See Arrows? [6.561578916344682]
We investigate whether an image encoder can learn edge representation through training on a diagram dataset.<n>To this end, we conduct contrastive learning on an artificially generated diagram--caption dataset to train an image encoder.<n>Our results show that the finetuned model outperforms pretrained CLIP in all tasks and surpasses zero-shot GPT-4o and LLaVA-Mistral in the captioning task.
arXiv Detail & Related papers (2025-05-26T13:09:31Z) - Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [79.75818239774952]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z) - Multi-View Empowered Structural Graph Wordification for Language Models [12.22063024099311]
We introduce an end-to-end modality-aligning framework for LLM-graph alignment: Dual-Residual Vector Quantized-Variational AutoEncoder, namely Dr.E.<n>Our approach is purposefully designed to facilitate token-level alignment with LLMs, enabling an effective translation of the intrinsic'of graphs into comprehensible natural language.<n>Our framework ensures certain visual interpretability, efficiency, and robustness, marking the promising successful endeavor to achieve token-level alignment between LLMs and GNNs.
arXiv Detail & Related papers (2024-06-19T16:43:56Z) - Visually Descriptive Language Model for Vector Graphics Reasoning [76.42082386029206]
We propose the Visually Descriptive Language Model (VDLM) to bridge the gap between low-level visual perception and high-level language reasoning.<n>We show that VDLM significantly improves state-of-the-art LMMs like GPT-4o on various multimodal perception and reasoning tasks.
arXiv Detail & Related papers (2024-04-09T17:30:18Z) - Exploring the Potential of Large Language Models (LLMs) in Learning on
Graphs [59.74814230246034]
Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities.
We investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors.
arXiv Detail & Related papers (2023-07-07T05:31:31Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Learning the Implicit Semantic Representation on Graph-Structured Data [57.670106959061634]
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole.
We propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs.
arXiv Detail & Related papers (2021-01-16T16:18:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.