LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
- URL: http://arxiv.org/abs/2505.19510v2
- Date: Thu, 29 May 2025 05:23:38 GMT
- Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
- Authors: Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo,
- Abstract summary: Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
- Score: 12.90392791734461
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
Related papers
- LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models [54.82915844507371]
Text-Attributed Graphs (TAGs) are ubiquitous in real-world scenarios.<n>Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures.<n>We propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning.
arXiv Detail & Related papers (2025-03-05T09:45:22Z) - Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation [26.19182768810174]
Graph-structured data has become increasingly prevalent across various domains, raising the demand for effective models to handle graph tasks.<n>Traditional graph learning models like Graph Neural Networks (GNNs) have made significant strides, but their capabilities in handling graph data remain limited in certain contexts.<n>In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks, yet most studies focus primarily on performance benchmarks.
arXiv Detail & Related papers (2025-02-26T03:03:46Z) - Are Large Language Models In-Context Graph Learners? [31.172657860606297]
Large language models (LLMs) have remarkable in-context reasoning capabilities across a wide range of tasks.<n>However, they struggle to handle structured data, such as graphs, due to their lack of understanding of non-Euclidean structures.<n>We show that learning on graph data can be conceptualized as a retrieval-augmented generation (RAG) process.<n>We propose a series of RAG frameworks to enhance the in-context learning capabilities of LLMs for graph learning tasks.
arXiv Detail & Related papers (2025-02-19T09:14:19Z) - Generative Visual Commonsense Answering and Explaining with Generative Scene Graph Constructing [46.701439459096235]
We propose a novel visual commonsense reasoning generation method named textittextbfG2.<n>It first utilizes the image patches and LLMs to construct a location-free scene graph, and then answer and explain based on the scene graph's information.<n>We also propose automatic scene graph filtering and selection strategies to absorb valuable scene graph information during training.
arXiv Detail & Related papers (2025-01-15T04:00:36Z) - SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation [8.768484848591168]
SceneLLM is a framework that transforms video frames into linguistic signals (scene tokens)<n>Our method achieves state-of-the-art results on the Action Genome (AG) benchmark.<n>Extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.
arXiv Detail & Related papers (2024-12-15T02:41:31Z) - What Do LLMs Need to Understand Graphs: A Survey of Parametric Representation of Graphs [69.48708136448694]
Large language models (LLMs) are reorganizing in the AI community for their expected reasoning and inference abilities.<n>We believe this kind of parametric representation of graphs, graph laws, can be a solution for making LLMs understand graph data as the input.
arXiv Detail & Related papers (2024-10-16T00:01:31Z) - How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension [53.6373473053431]
This work introduces a benchmark to assess large language models' capabilities in graph pattern tasks.<n>We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions.<n>Our benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models.
arXiv Detail & Related papers (2024-10-04T04:48:33Z) - Parameter-Efficient Tuning Large Language Models for Graph Representation Learning [62.26278815157628]
We introduce Graph-aware.
Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning.
We use a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt.
We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations.
arXiv Detail & Related papers (2024-04-28T18:36:59Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - Large Language Models on Graphs: A Comprehensive Survey [77.16803297418201]
We provide a systematic review of scenarios and techniques related to large language models on graphs.
We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-attributed graphs, and text-paired graphs.
We discuss the real-world applications of such methods and summarize open-source codes and benchmark datasets.
arXiv Detail & Related papers (2023-12-05T14:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.