Related papers: Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

URL: http://arxiv.org/abs/2510.02394v1
Date: Wed, 01 Oct 2025 04:01:17 GMT
Title: Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing
Authors: Manasi Patwardhan, Ayush Agarwal, Shabbirhussain Bhaisaheb, Aseem Arora, Lovekesh Vig, Sunita Sarawagi,
Abstract summary: We propose a systematic framework for associating structured domain statements at the database level.<n>We present retrieval of relevant structured domain statements given a user query using sub-string level match.
Score: 28.56221748194599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

Related papers

Routing End User Queries to Enterprise Databases [13.367384894681651]
We construct realistic benchmarks by extending existing NL-to- datasets.<n>Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries.
arXiv Detail & Related papers (2026-01-27T17:30:19Z)
ORANGE: An Online Reflection ANd GEneration framework with Domain Knowledge for Text-to-SQL [8.241433772695018]
Large Language Models (LLMs) have demonstrated remarkable progress in translating natural language tosql.<n>A significant semantic gap persists between their general knowledge and domain-specific semantics of databases.<n>We introduce Orange, an online self-evolutionary framework that constructs database-specific knowledge bases by parsing queries from translation logs.
arXiv Detail & Related papers (2025-11-02T15:57:18Z)
SQL-Exchange: Transforming SQL Queries Across Domains [5.5643498845134545]
We introduce a framework for mapping queries across different database schemas by preserving the source query structure while adapting domain-specific elements to align with the target schema.<n>We investigate the conditions under which such mappings are feasible and beneficial, and examine their impact on enhancing the in-context learning performance of text-to-context systems.
arXiv Detail & Related papers (2025-08-09T19:55:54Z)
RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL [1.3654846342364308]
We introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units.<n>Our solution enables practical text-to- interfaces across diverse enterprise settings without specialized fine-tuning.
arXiv Detail & Related papers (2025-07-30T21:09:47Z)
Knowledge Base Construction for Knowledge-Augmented Text-to-SQL [37.87911346522774]
We propose constructing a knowledge base for text-to-one, a foundational source of knowledge, from which we generate necessary knowledge for given queries.<n>Our knowledge base is comprehensive, which is constructed based on a combination of all available questions and their associated database schemas.<n>We validate our approach on multiple text-to-one datasets, considering both overlapping and non-overlapping database scenarios.
arXiv Detail & Related papers (2025-05-28T08:17:58Z)
Datrics Text2SQL: A Framework for Natural Language to SQL Query Generation [0.0]
This paper introduces a Retrieval-Augmented Generation (RAG)-based framework designed to generate accuratesql queries by leveraging structured documentation, example-based learning, and domain-specific rules.<n>The paper details the architecture, training methodology, and retrieval logic, highlighting how the system bridges the gap between user intent and database structure without requiringsql expertise.
arXiv Detail & Related papers (2025-04-03T21:09:59Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems. It is composed of publicly available text-to-domain datasets and 29K databases. Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z)
Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge [54.85168428642474]
We build a new Chinese benchmark Know consisting of domain-specific questions covering various domains. We then address this problem by presenting formulaic knowledge, rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing.
arXiv Detail & Related papers (2023-01-03T12:37:47Z)
Uni-Parser: Unified Semantic Parser for Question Answering on Knowledge Base and Database [86.03294330305097]
We propose a unified semantic element for question answering (QA) on both knowledge bases (KB) and databases (DB) We introduce the primitive (relation and entity in KB, table name, column name and cell value in DB) as an essential element in our framework. We leverage the generator to predict final logical forms by altering and composing topranked primitives with different operations.
arXiv Detail & Related papers (2022-11-09T19:33:27Z)
KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers [26.15889661083109]
We present KDBaggleQA, a new cross-domain evaluation dataset of real Web databases. We show that KDBaggleQA presents a challenge to state-of-the-art zero-shots but that a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%.
arXiv Detail & Related papers (2021-06-22T00:08:03Z)
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question. BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks. Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)
DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs) We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.