Related papers: FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL

FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL

URL: http://arxiv.org/abs/2512.12084v1
Date: Fri, 12 Dec 2025 23:25:00 GMT
Title: FloodSQL-Bench: A Retrieval-Augmented Benchmark for Geospatially-Grounded Text-to-SQL
Authors: Hanzhou Liu, Kai Yin, Zhitong Chen, Chenyue Liu, Ali Mostafavi,
Abstract summary: FLOOD-BENCH is a benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins.<n>The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers.
Score: 4.973502845481286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Text-to-SQL benchmarks primarily focus on single-table queries or limited joins in general-purpose domains, and thus fail to reflect the complexity of domain-specific, multi-table and geospatial reasoning, To address this limitation, we introduce FLOODSQL-BENCH, a geospatially grounded benchmark for the flood management domain that integrates heterogeneous datasets through key-based, spatial, and hybrid joins. The benchmark captures realistic flood-related information needs by combining social, infrastructural, and hazard data layers. We systematically evaluate recent large language models with the same retrieval-augmented generation settings and measure their performance across difficulty tiers. By providing a unified, open benchmark grounded in real-world disaster management data, FLOODSQL-BENCH establishes a practical testbed for advancing Text-to-SQL research in high-stakes application domains.

Related papers

SpotIt+: Verification-based Text-to-SQL Evaluation with Database Constraints [9.733987594033907]
SpotIt+ is a tool for evaluating text-to-speech systems via bounded equivalence verification.<n>We introduce a constraint-mining pipeline that combines rule-based specification mining over example databases with LLM-based validation.<n> Experimental results on the BIRD dataset show that the mined constraints enable SpotIt+ to generate more realistic differentiating databases.
arXiv Detail & Related papers (2026-03-04T17:51:42Z)
APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL [39.76924093980244]
APEX- verbalize is a framework that shifts the paradigm from passive translation to agentic exploration.<n>Our framework employs a hypothesis-verification loop to ground model reasoning in real data.
arXiv Detail & Related papers (2026-02-11T07:50:47Z)
Routing End User Queries to Enterprise Databases [13.367384894681651]
We construct realistic benchmarks by extending existing NL-to- datasets.<n>Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries.
arXiv Detail & Related papers (2026-01-27T17:30:19Z)
Companion Agents: A Table-Information Mining Paradigm for Text-to-SQL [8.159121916366727]
Large-scale Text-to-curated benchmarks such as BIRD typically assume complete and accurate database annotations as well as available external knowledge.<n>This mismatch substantially limits the real-world applicability of state-of-the-domain Text-to-art systems.<n>We propose a database-centric approach that leverages intrinsic, fine-grained information residing in relational databases to construct missing evidence.
arXiv Detail & Related papers (2025-12-17T07:11:55Z)
Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z)
GeoSQL-Eval: First Evaluation of LLMs on PostGIS-Based NL2GeoSQL Queries [12.523407991161315]
We present Geo-Eval, the first end-to-end automated evaluation framework for PostGIS generation.<n>We also release a public Geo-Eval leaderboard platform for continuous testing and global comparison.
arXiv Detail & Related papers (2025-09-28T04:50:48Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries [36.92547259037192]
The proliferation of unstructured data poses a fundamental challenge to traditional database infrastructure.<n>While Text-to-BIRD has democratized access to structured data, it remains incapable of interpreting semantic or multi-modal queries.<n>We introduce and formalize Text2 Vector, a novel task to establish a unified natural language for seamlessly querying both structured and unstructured data.
arXiv Detail & Related papers (2025-06-29T03:17:42Z)
LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO [0.6374763930914525]
We introduce a novel F1-score-based'soft' metric that quantifies the informational overlap between generated and ground-truth results.<n>We demonstrate our contributions through an empirical evaluation on an authentic MRO database.
arXiv Detail & Related papers (2025-06-11T04:04:13Z)
RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z)
WikiDBGraph: A Data Management Benchmark Suite for Collaborative Learning over Database Silos [48.88393315169039]
Collaborative learning (CL) techniques enable multiple parties to train models jointly without sharing raw data.<n>Current CL benchmarks and algorithms primarily target the learning step under assumptions of isolated, aligned, and joinable databases.<n>We build a large-scale dataset constructed from 100,000 real-world relational databases linked by 17 million weighted edges.
arXiv Detail & Related papers (2025-05-22T13:07:06Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
CHESS: Contextual Harnessing for Efficient SQL Synthesis [1.9506402593665235]
We introduce CHESS, a framework for efficient and scalable text-to- queries. It comprises four specialized agents, each targeting one of the aforementioned challenges. Our framework offers features that adapt to various deployment constraints.
arXiv Detail & Related papers (2024-05-27T01:54:16Z)
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases [93.96463520716759]
We develop STARK, a large-scale Semi-structure retrieval benchmark on Textual and Knowledge Bases. Our benchmark covers three domains: product search, academic paper search, and queries in precision medicine. We design a novel pipeline to synthesize realistic user queries that integrate diverse relational information and complex textual properties.
arXiv Detail & Related papers (2024-04-19T22:54:54Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z)
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.