ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural
Language to SQL Systems
- URL: http://arxiv.org/abs/2306.04743v2
- Date: Tue, 5 Dec 2023 15:05:58 GMT
- Title: ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural
Language to SQL Systems
- Authors: Yi Zhang, Jan Deriu, George Katsogiannis-Meimarakis, Catherine Kosten,
Georgia Koutrika, Kurt Stockinger
- Abstract summary: We introduce ScienceBenchmark, a new complex NL-to- benchmark for three real-world, highly domain-specific databases.
We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark.
- Score: 16.33799752421288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language to SQL systems (NL-to-SQL) have recently shown a significant
increase in accuracy for natural language to SQL query translation. This
improvement is due to the emergence of transformer-based language models, and
the popularity of the Spider benchmark - the de-facto standard for evaluating
NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%.
However, Spider mainly contains simple databases with few tables, columns, and
entries, which does not reflect a realistic setting. Moreover, complex
real-world databases with domain-specific content have little to no training
data available in the form of NL/SQL-pairs leading to poor performance of
existing NL-to-SQL systems.
In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL
benchmark for three real-world, highly domain-specific databases. For this new
benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for
each domain. To garner more data, we extended the small amount of
human-generated data with synthetic data generated using GPT-3. We show that
our benchmark is highly challenging, as the top performing systems on Spider
achieve a very low performance on our benchmark. Thus, the challenge is
many-fold: creating NL-to-SQL systems for highly complex domains with a small
amount of hand-made training data augmented with synthetic data. To our
knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with
complex real-world scientific databases, containing challenging training and
test data carefully validated by domain experts.
Related papers
- CodeS: Towards Building Open-source Language Models for Text-to-SQL [42.11113113574589]
We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B.
CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes.
We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
arXiv Detail & Related papers (2024-02-26T07:00:58Z) - Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries [4.141402725050671]
This paper is the first in-depth evaluation of the data model robustness of Text-to-- systems in practice.
It is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022.
All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models.
arXiv Detail & Related papers (2024-02-13T10:28:57Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - Weakly Supervised Text-to-SQL Parsing through Question Decomposition [53.22128541030441]
We take advantage of the recently proposed question meaning representation called QDMR.
Given questions, their QDMR structures (annotated by non-experts or automatically predicted) and the answers, we are able to automatically synthesizesql queries.
Our results show that the weakly supervised models perform competitively with those trained on NL- benchmark data.
arXiv Detail & Related papers (2021-12-12T20:02:42Z) - "What Do You Mean by That?" A Parser-Independent Interactive Approach
for Enhancing Text-to-SQL [49.85635994436742]
We include human in the loop and present a novel-independent interactive approach (PIIA) that interacts with users using multi-choice questions.
PIIA is capable of enhancing the text-to-domain performance with limited interaction turns by using both simulation and human evaluation.
arXiv Detail & Related papers (2020-11-09T02:14:33Z) - ValueNet: A Natural Language-to-SQL System that Learns from Database
Information [4.788755317132195]
Building natural language interfaces for databases has been a long-standing challenge.
Recent focus of research has been on neural networks to tackle this challenge on complex datasets like Spider.
We propose two end-to-end NL-to-end systems that incorporate values using the challenging Spider.
arXiv Detail & Related papers (2020-05-29T15:43:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.