SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset
- URL: http://arxiv.org/abs/2208.12711v1
- Date: Fri, 26 Aug 2022 15:11:10 GMT
- Title: SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset
- Authors: Saihao Huang, Lijie Wang, Zhenghua Li, Zeyang Liu, Chenhui Dou, Fukang
Yan, Xinyan Xiao, Hua Wu, Min Zhang
- Abstract summary: CHASE contains 2,003 sessions manually constructed from scratch (CHASE-C) and 3,456 sessions translated from English (CHASE-T)
We find the two parts are highly discrepant and incompatible as training and evaluation data.
In this work, we present Se, yet another large-scale session-level text-to- parsing dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch.
- Score: 39.78074639729293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As the first session-level Chinese dataset, CHASE contains two separate
parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and
3,456 sessions translated from English SParC (CHASE-T). We find the two parts
are highly discrepant and incompatible as training and evaluation data. In this
work, we present SeSQL, yet another large-scale session-level text-to-SQL
dataset in Chinese, consisting of 5,028 sessions all manually constructed from
scratch. In order to guarantee data quality, we adopt an iterative annotation
workflow to facilitate intense and in-time review of previous-round natural
language (NL) questions and SQL queries. Moreover, by completing all
context-dependent NL questions, we obtain 27,012 context-independent
question/SQL pairs, allowing SeSQL to be used as the largest dataset for
single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level
text-to-SQL parsing experiments on SeSQL by employing three competitive
session-level parsers, and present detailed analysis.
Related papers
- Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation [0.10499611180329804]
This paper introduces text-2--4-PM, a benchmark dataset for the text-to-four task in the process mining domain.<n>The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205sql statements, and ten qualifiers.<n>The results show that text-2--4-PM supports evaluation of text-to-four implementations, offering broader applicability for semantic parsing and other natural language processing tasks.
arXiv Detail & Related papers (2025-08-18T01:25:41Z) - OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale [31.852909145101677]
We propose a novel and scalable text-to-data framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention.
We introduce Syn-2.5M, the first million-scale text-to-dataset, containing 2.5 million samples spanning over 16,000 synthetic databases.
We develop Omni, a powerful open-source text-to-model available in three sizes: 7B, 14B, and 32B.
arXiv Detail & Related papers (2025-03-04T03:30:56Z) - A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? [32.84561352339466]
We provide a review of Text-to- translation techniques powered by Large Language Models (LLMs)<n>We discuss the research challenges and open problems of Text-to- evaluation in the LLMs era.
arXiv Detail & Related papers (2024-08-09T14:59:36Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset [40.43236560948185]
We present a large-scale CrosS-Chinese text-to-hugging dataset to carry on corresponding studies.
CSS originally consisted of 4,340/ question pairs across 2 databases.
In order to generalize models to different medical systems, we create 19 new databases along with 29,280 corresponding examples.
arXiv Detail & Related papers (2023-05-25T09:44:44Z) - Towards Generalizable and Robust Text-to-SQL Parsing [77.18724939989647]
We propose a novel TKK framework consisting of Task decomposition, Knowledge acquisition, and Knowledge composition to learn text-to- parsing in stages.
We show that our framework is effective in all scenarios and state-of-the-art performance on the Spider, SParC, and Co. datasets.
arXiv Detail & Related papers (2022-10-23T09:21:27Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - Weakly Supervised Text-to-SQL Parsing through Question Decomposition [53.22128541030441]
We take advantage of the recently proposed question meaning representation called QDMR.
Given questions, their QDMR structures (annotated by non-experts or automatically predicted) and the answers, we are able to automatically synthesizesql queries.
Our results show that the weakly supervised models perform competitively with those trained on NL- benchmark data.
arXiv Detail & Related papers (2021-12-12T20:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.