Related papers: Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]

URL: http://arxiv.org/abs/2602.11745v1
Date: Thu, 12 Feb 2026 09:16:44 GMT
Title: Text2GQL-Bench: A Text to Graph Query Language Benchmark [Experiment, Analysis & Benchmark]
Authors: Songlin Lyu, Lujie Ban, Zihang Wu, Tianqi Luo, Jirong Liu, Chenhao Ma, Yuyu Luo, Nan Tang, Shipeng Qi, Heng Lin, Yongchao Liu, Chuntao Hong,
Abstract summary: Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries.<n>Existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope.<n>We present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations.
Score: 16.678372445240957
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Graph models are fundamental to data analysis in domains rich with complex relationships. Text-to-Graph-Query-Language (Text-to-GQL) systems act as a translator, converting natural language into executable graph queries. This capability allows Large Language Models (LLMs) to directly analyze and manipulate graph data, posi-tioning them as powerful agent infrastructures for Graph Database Management System (GDBMS). Despite recent progress, existing datasets are often limited in domain coverage, supported graph query languages, or evaluation scope. The advancement of Text-to-GQL systems is hindered by the lack of high-quality benchmark datasets and evaluation methods to systematically compare model capabilities across different graph query languages and domains. In this work, we present Text2GQL-Bench, a unified Text-to-GQL benchmark designed to address these limitations. Text2GQL-Bench couples a multi-GQL dataset that has 178,184 (Question, Query) pairs spanning 13 domains, with a scalable construction framework that generates datasets in different domains, question abstraction levels, and GQLs with heterogeneous resources. To support compre-hensive assessment, we introduce an evaluation method that goes beyond a single end-to-end metric by jointly reporting grammatical validity, similarity, semantic alignment, and execution accuracy. Our evaluation uncovers a stark dialect gap in ISO-GQL generation: even strong LLMs achieve only at most 4% execution accuracy (EX) in zero-shot settings, though a fixed 3-shot prompt raises accuracy to around 50%, the grammatical validity remains lower than 70%. Moreover, a fine-tuned 8B open-weight model reaches 45.1% EX, and 90.8% grammatical validity, demonstrating that most of the performance jump is unlocked by exposure to sufficient ISO-GQL examples.

Related papers

SPARQL-LLM: Real-Time SPARQL Query Generation from Natural Language Questions [1.3856736555085554]
SPARQL-LLM is an open-source and triplestore-agnostic approach, powered by lightweight metadata, that generates SPARQL queries from natural language text.<n>We show that SPARQL-LLM is up to 36x faster than other systems participating in the challenge, while costing a maximum of $0.01 per question.
arXiv Detail & Related papers (2025-12-16T10:39:46Z)
Towards Global Retrieval Augmented Generation: A Benchmark for Corpus-Level Reasoning [50.27838512822097]
We introduce GlobalQA, the first benchmark specifically designed to evaluate global RAG capabilities.<n>We propose GlobalRAG, a multi-tool collaborative framework that preserves structural coherence through chunk-level retrieval.<n>On the Qwen2.5-14B model, GlobalRAG achieves 6.63 F1 compared to the strongest baseline's 1.51 F1.
arXiv Detail & Related papers (2025-10-30T07:29:14Z)
GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs [4.005483185111992]
We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries.<n>Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals.
arXiv Detail & Related papers (2025-07-10T18:50:05Z)
NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language [13.661054027428868]
We propose NAT-NL2GQL, a novel framework for translating natural language to graph query language.<n>Our framework consists of three synergistic agents: the Preprocessor agent, the Generator agent, and the Refiner agent.<n>Given the scarcity of high-quality open-source NL2GQL datasets based on nGQL syntax, we developed StockGQL, a dataset constructed from a financial market graph database.
arXiv Detail & Related papers (2024-12-11T04:14:09Z)
Towards Evaluating Large Language Models for Graph Query Generation [49.49881799107061]
Large Language Models (LLMs) are revolutionizing the landscape of Generative Artificial Intelligence (GenAI) This paper presents a comparative study addressing the challenge of generating queries a powerful language for interacting with graph databases using open-access LLMs. Our empirical analysis of query generation accuracy reveals that Claude Sonnet 3.5 outperforms its counterparts in this specific domain.
arXiv Detail & Related papers (2024-11-13T09:11:56Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models [70.03602551880526]
We introduce ProGraph, a benchmark for large language models (LLMs) to process graphs.<n>Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy.<n>We propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries.
arXiv Detail & Related papers (2024-09-29T11:38:45Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL [16.637504932927616]
We present a well-defined pipeline for NL2GQL tasks tailored to a particular domain. We employ ChatGPT to generate NLGQL data pairs, leveraging the provided graph DB with self-instruction. We then employ the generated data to fine-tune LLMs, ensuring alignment between LLMs and the graph DB.
arXiv Detail & Related papers (2024-02-26T13:46:51Z)
$R^3$-NL2GQL: A Model Coordination and Knowledge Graph Alignment Approach for NL2GQL [45.13624736815995]
We introduce a novel approach, $R3$-NL2GQL, integrating both small and large Foundation Models for ranking, rewriting, and refining tasks. We have developed a bilingual dataset, sourced from graph database manuals and selected open-source Knowledge Graphs (KGs)
arXiv Detail & Related papers (2023-11-03T12:11:12Z)
LMGQS: A Large-scale Dataset for Query-focused Summarization [77.6179359525065]
We convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS. We establish baselines with state-of-the-art summarization models. We achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks.
arXiv Detail & Related papers (2023-05-22T14:53:45Z)
ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries [10.273545005890496]
We introduce data augmentation techniques and a sampling-based content-aware BERT model (ColloQL) ColloQL achieves 84.9% (execution) and 90.7% (execution) accuracy on the Wikilogical dataset.
arXiv Detail & Related papers (2020-10-19T23:53:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.