FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL
- URL: http://arxiv.org/abs/2512.12084v1
- Date: Fri, 12 Dec 2025 23:25:00 GMT
- Title: FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL
- Authors: Hanzhou Liu, Kai Yin, Zhitong Chen, Chenyue Liu, Ali Mostafavi,
- Abstract summary: FLOOD-BENCH is a benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins.<n>The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers.
- Score: 4.973502845481286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Text-to-SQL benchmarks primarily focus on single-table queries or limited joins in general-purpose domains, and thus fail to reflect the complexity of domain-specific, multi-table and geospatial reasoning, To address this limitation, we introduce FLOODSQL-BENCH, a geospatially grounded benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins. The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers. We systematically evaluate recent large language models with the same retrieval-augmented generation settings and measure their performance across difficulty tiers. By providing a unified, open benchmark grounded in real-world disaster management data, FLOODSQL-BENCH establishes a practical testbed for advancing Text-to-SQL research in high-stakes application domains.
Related papers
- SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints [9.733987594033907]
SpotIt+ is a tool for evaluating text-to-speech systems via bounded equivalence verification.<n>We introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation.<n> Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases.
arXiv Detail & Related papers (2026-03-04T17:51:42Z) - APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL [39.76924093980244]
APEX- verbalize is a framework that shifts the paradigm from passive translation to agentic exploration.<n>Our framework employs a hypothesis-verification loop to ground model reasoning in real data.
arXiv Detail & Related papers (2026-02-11T07:50:47Z) - Routing End User Queries to Enterprise Databases [13.367384894681651]
We construct realistic benchmarks by extending existing NL-to- datasets.<n>Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries.
arXiv Detail & Related papers (2026-01-27T17:30:19Z) - Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL [8.159121916366727]
Large-scale Text-to-curated benchmarks such as BIRD typically assume complete and accurate database annotations as well as available external knowledge.<n>This mismatch substantially limits the real-world applicability of state-of-the-domain Text-to-art systems.<n>We propose a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence.
arXiv Detail & Related papers (2025-12-17T07:11:55Z) - Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries [12.523407991161315]
We present Geo-Eval, the first end-to-end automated evaluation framework for PostGIS generation.<n>We also release a public Geo-Eval leaderboard platform for continuous testing and global comparison.
arXiv Detail & Related papers (2025-09-28T04:50:48Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries [36.92547259037192]
The proliferation of unstructured data poses a fundamental challenge to traditional database infrastructure.<n>While Text-to-BIRD has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries.<n>We introduce and formalize Text2 Vector, a novel task to establish a unified natural language for seamlessly querying both structured and unstructured data.
arXiv Detail & Related papers (2025-06-29T03:17:42Z) - LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO [0.6374763930914525]
We introduce a novel F1-score-based'soft' metric that quantifies the informational overlap between generated and ground-truth results.<n>We demonstrate our contributions through an empirical evaluation on an authentic MRO database.
arXiv Detail & Related papers (2025-06-11T04:04:13Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos [48.88393315169039]
Collaborative learning (CL) techniques enable multiple parties to train models jointly without sharing raw data.<n>Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases.<n>We build a large-scale dataset constructed from 100,000 real-world relational databases linked by 17 million weighted edges.
arXiv Detail & Related papers (2025-05-22T13:07:06Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - CHESS: Contextual Harnessing for Efficient SQL Synthesis [1.9506402593665235]
We introduce CHESS, a framework for efficient and scalable text-to- queries.
It comprises four specialized agents, each targeting one of the aforementioned challenges.
Our framework offers features that adapt to various deployment constraints.
arXiv Detail & Related papers (2024-05-27T01:54:16Z) - STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases.
Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine.
We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question.
BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks.
Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.