Declarative Techniques for NL Queries over Heterogeneous Data
- URL: http://arxiv.org/abs/2510.16470v1
- Date: Sat, 18 Oct 2025 12:27:59 GMT
- Title: Declarative Techniques for NL Queries over Heterogeneous Data
- Authors: Elham Khabiri, Jeffrey O. Kephart, Fenno F. Heath III, Srideepika Jayaraman, Fateh A. Tipu, Yingjie Li, Dhruv Shah, Achille Fokoue, Anu Bhamidipaty,
- Abstract summary: In many industrial settings, users wish to ask questions in natural language, the answers to which require assembling information from diverse structured data sources.<n>With the advent of Large Language Models (LLMs), applications can now translate natural language questions into a set of API calls or database calls, execute them, and combine the results into an appropriate natural language response.<n>However, these applications remain impractical in realistic industrial settings because they do not cope with the data source heterogeneity that typifies such environments.
- Score: 15.249556281397608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many industrial settings, users wish to ask questions in natural language, the answers to which require assembling information from diverse structured data sources. With the advent of Large Language Models (LLMs), applications can now translate natural language questions into a set of API calls or database calls, execute them, and combine the results into an appropriate natural language response. However, these applications remain impractical in realistic industrial settings because they do not cope with the data source heterogeneity that typifies such environments. In this work, we simulate the heterogeneity of real industry settings by introducing two extensions of the popular Spider benchmark dataset that require a combination of database and API calls. Then, we introduce a declarative approach to handling such data heterogeneity and demonstrate that it copes with data source heterogeneity significantly better than state-of-the-art LLM-based agentic or imperative code generation systems. Our augmented benchmarks are available to the research community.
Related papers
- Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z) - Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things [0.0]
Internet of Battlefield Things (IoBT) gives rise to new opportunities for enhancing situational awareness.<n>To increase the potential of IoBT for situational awareness in critical decision making, the data from these devices must be processed into consumer-ready information objects.<n>We propose a workflow that makes use of natural language processing (NLP) to query a database technology and return a response in natural language.
arXiv Detail & Related papers (2025-06-05T20:52:13Z) - RAISE: Reasoning Agent for Interactive SQL Exploration [47.77323087050061]
We propose a novel framework that unifies schema linking, query generation, and iterative refinement within a single, end-to-end component.<n>Our method emulates how humans answer questions when working with unfamiliar databases.
arXiv Detail & Related papers (2025-06-02T03:07:08Z) - RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z) - Needle: A Generative AI-Powered Multi-modal Database for Answering Complex Natural Language Queries [8.779871128906787]
Multi-modal datasets often miss the detailed descriptions that properly capture the rich information encoded in each item.<n>This makes answering complex natural language queries a major challenge in this domain.<n>We introduce a Generative-based Monte Carlo method that utilizes foundation models to generate synthetic samples.<n>Our system is open-source and ready for deployment, designed to be easily adopted by researchers and developers.
arXiv Detail & Related papers (2024-12-01T01:36:41Z) - A System and Benchmark for LLM-based Q&A on Heterogeneous Data [17.73258512415368]
We introduce the siwarex platform, which enables seamless natural language access to both databases and APIs.
Our modified Spider benchmark will soon be available to the research community.
arXiv Detail & Related papers (2024-09-09T15:44:39Z) - Text2SQL is Not Enough: Unifying AI and Databases with TAG [47.45480855418987]
Table-Augmented Generation (TAG) is a paradigm for answering natural language questions over databases.
We develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly.
arXiv Detail & Related papers (2024-08-27T00:50:14Z) - UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics.
We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z) - Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks.
We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data.
We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.