Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense
and Hypothetical Reasoning
- URL: http://arxiv.org/abs/2402.12554v2
- Date: Sun, 25 Feb 2024 00:12:38 GMT
- Title: Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense
and Hypothetical Reasoning
- Authors: Danna Zheng, Mirella Lapata, Jeff Z. Pan
- Abstract summary: This dataset demonstrates a significantly higher level of complexity compared to existing publicly available datasets.
Archer challenges the capabilities of current state-of-the-art models, with a high-ranked model on the Spider leaderboard achieving only 6.73% execution accuracy on Archer test set.
- Score: 67.7258569181669
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Archer, a challenging bilingual text-to-SQL dataset specific to
complex reasoning, including arithmetic, commonsense and hypothetical
reasoning. It contains 1,042 English questions and 1,042 Chinese questions,
along with 521 unique SQL queries, covering 20 English databases across 20
domains. Notably, this dataset demonstrates a significantly higher level of
complexity compared to existing publicly available datasets. Our evaluation
shows that Archer challenges the capabilities of current state-of-the-art
models, with a high-ranked model on the Spider leaderboard achieving only 6.73%
execution accuracy on Archer test set. Thus, Archer presents a significant
challenge for future research in this field.
Related papers
- Monte Carlo Tree Search with Reasoning Path Refinement for Small Language Models in Conversational Text-to-NoSQL [20.156191782890797]
We introduce the Conversational Text-to-No task, which generates queries given a natural language question, a database, and a dialogue history.<n>We propose Stage-MCTS, a framework that endows small language models with query-specific reasoning capabilities.<n>Our approach outperforms state-of-the-art large reasoning models, improving execution value match accuracy by up to 7.93%.
arXiv Detail & Related papers (2026-02-13T03:35:38Z) - Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic [2.8855202197281007]
We introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-context dataset.<n>The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions.<n>We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo.
arXiv Detail & Related papers (2025-11-16T00:05:40Z) - LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Multi-Domain Reasoning Challenges [13.400649304012179]
The dataset consists of 4,038 English questions, each paired with a uniquesql query and accompanied by 12,114 reasoning annotations, spanning 45 databases across diverse domains.<n>LogicCat substantially increases the difficulty for state-of-the-art models, with the highest execution accuracy reaching only 14.96%.<n> Benchmarking leading public methods on Spider and BIRD further underscores the challenges presented by LogicCat, highlighting the significant opportunities for advancing research in robust, reasoning-driven text-to-funk systems.
arXiv Detail & Related papers (2025-05-24T15:23:43Z) - Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija [5.762345156477737]
This work introduces the first large-scale, cross-domain text-to-IDER- dataset in an Arabic dialect.
It consists of 9,428 NLQ- pairs across 69 databases in various domains.
The dataset also incorporates the complexities of the Moroccan dialect, which is known for its source languages.
arXiv Detail & Related papers (2025-01-20T14:06:40Z) - INQUIRE: A Natural World Text-to-Image Retrieval Benchmark [51.823709631153946]
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries.
InQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries.
Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals.
arXiv Detail & Related papers (2024-11-04T19:16:53Z) - CodeS: Towards Building Open-source Language Models for Text-to-SQL [42.11113113574589]
We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B.
CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes.
We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
arXiv Detail & Related papers (2024-02-26T07:00:58Z) - Ar-Spider: Text-to-SQL in Arabic [11.463438573648297]
This paper introduces Ar-Spider 1, the first Arabic cross-language text-to-domain dataset.
Due to the unique nature of the language, two major challenges have been encountered, namely linguistic and structural challenges.
We propose the similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2 and 1.06% for LGE and closes the gap between Arabic and English languages to 7.73%.
arXiv Detail & Related papers (2024-02-22T23:11:17Z) - Text2Analysis: A Benchmark of Table Question Answering with Advanced
Data Analysis and Unclear Queries [67.0083902913112]
We develop the Text2Analysis benchmark, incorporating advanced analysis tasks.
We also develop five innovative and effective annotation methods.
We evaluate five state-of-the-art models using three different metrics.
arXiv Detail & Related papers (2023-12-21T08:50:41Z) - CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale
and High Quality [42.246771022648765]
We present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality.
The dataset aims to generate textual descriptions for the answer in the practical TableQA system.
We propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources.
arXiv Detail & Related papers (2023-06-20T12:02:26Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL
Robustness [115.66421993459663]
Recent studies reveal that text-to- models are vulnerable to task-specific perturbations.
We propose a comprehensive robustness benchmark based on Spider to diagnose the model.
We conduct a diagnostic study of the state-of-the-art models on the set.
arXiv Detail & Related papers (2023-01-21T03:57:18Z) - Possible Stories: Evaluating Situated Commonsense Reasoning under
Multiple Possible Scenarios [8.553766123004682]
This study frames this task by asking multiple questions with the same set of possible endings as candidate answers.
Our dataset consists of more than 4.5K questions over 1.3K story texts in English.
arXiv Detail & Related papers (2022-09-16T07:38:51Z) - Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack
Exchange Data [3.06261471569622]
SEDE is a dataset with 12,023 pairs of utterances andsql queries collected from real usage on the Stack Exchange website.
We show that these pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset.
arXiv Detail & Related papers (2021-06-09T12:09:51Z) - TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and
Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA.
We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.