Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
- URL: http://arxiv.org/abs/2310.18538v1
- Date: Fri, 27 Oct 2023 23:36:14 GMT
- Title: Evaluating Cross-Domain Text-to-SQL Models and Benchmarks
- Authors: Mohammadreza Pourreza and Davood Rafiei
- Abstract summary: We study text-to- benchmarks and re-evaluate some of the top-performing models within these benchmarks.
We find that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples.
A GPT4-based model surpasses the gold standard reference queries in the Spider benchmark in our human evaluation.
- Score: 7.388002745070808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-SQL benchmarks play a crucial role in evaluating the progress made in
the field and the ranking of different models. However, accurately matching a
model-generated SQL query to a reference SQL query in a benchmark fails for
various reasons, such as underspecified natural language queries, inherent
assumptions in both model-generated and reference queries, and the
non-deterministic nature of SQL output under certain conditions. In this paper,
we conduct an extensive study of several prominent cross-domain text-to-SQL
benchmarks and re-evaluate some of the top-performing models within these
benchmarks, by both manually evaluating the SQL queries and rewriting them in
equivalent expressions. Our evaluation reveals that attaining a perfect
performance on these benchmarks is unfeasible due to the multiple
interpretations that can be derived from the provided samples. Furthermore, we
find that the true performance of the models is underestimated and their
relative performance changes after a re-evaluation. Most notably, our
evaluation reveals a surprising discovery: a recent GPT4-based model surpasses
the gold standard reference queries in the Spider benchmark in our human
evaluation. This finding highlights the importance of interpreting benchmark
evaluations cautiously, while also acknowledging the critical role of
additional independent evaluations in driving advancements in the field.
Related papers
- Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems.
Our metric improves agreement with human experts with comprehensive context and sophisticated criteria.
This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z) - Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload [1.2738020945091273]
TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks.
Findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries.
Results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.
arXiv Detail & Related papers (2024-07-28T15:53:05Z) - Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks [2.1899189033259305]
The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.
This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest.
We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, and (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
arXiv Detail & Related papers (2024-04-25T18:35:54Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - EvalLM: Interactive Evaluation of Large Language Model Prompts on
User-Defined Criteria [43.944632774725484]
We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria.
By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail.
A comparative study showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions.
arXiv Detail & Related papers (2023-09-24T13:19:38Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.