CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
- URL: http://arxiv.org/abs/2305.15891v1
- Date: Thu, 25 May 2023 09:44:44 GMT
- Title: CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset
- Authors: Hanchong Zhang, Jieyu Li, Lu Chen, Ruisheng Cao, Yunyan Zhang, Yu
Huang, Yefeng Zheng, Kai Yu
- Abstract summary: We present a large-scale CrosS-Chinese text-to-hugging dataset to carry on corresponding studies.
CSS originally consisted of 4,340/ question pairs across 2 databases.
In order to generalize models to different medical systems, we create 19 new databases along with 29,280 corresponding examples.
- Score: 40.43236560948185
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The cross-domain text-to-SQL task aims to build a system that can parse user
questions into SQL on complete unseen databases, and the single-domain
text-to-SQL task evaluates the performance on identical databases. Both of
these setups confront unavoidable difficulties in real-world applications. To
this end, we introduce the cross-schema text-to-SQL task, where the databases
of evaluation data are different from that in the training data but come from
the same domain. Furthermore, we present CSS, a large-scale CrosS-Schema
Chinese text-to-SQL dataset, to carry on corresponding studies. CSS originally
consisted of 4,340 question/SQL pairs across 2 databases. In order to
generalize models to different medical systems, we extend CSS and create 19 new
databases along with 29,280 corresponding dataset examples. Moreover, CSS is
also a large corpus for single-domain Chinese text-to-SQL studies. We present
the data collection approach and a series of analyses of the data statistics.
To show the potential and usefulness of CSS, benchmarking baselines have been
conducted and reported. Our dataset is publicly available at
\url{https://huggingface.co/datasets/zhanghanchong/css}.
Related papers
- SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas [2.905751301655124]
Key bottleneck for developing text-to-hugging models is lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity.<n>We introduce SQaLe: a large-scale semi-synthetic text-to-hugging dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas,Pile.<n>SQaLe captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity.
arXiv Detail & Related papers (2025-12-16T09:15:10Z) - PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation [21.0303026118673]
We introduce PARROT, a practical and realistic benchmak for CrOss-System SQL Translation.<n> PARROT comprises 598 translation pairs from 38 open-source benchmarks and real-world business services.<n>We also provide multiple benchmark variants, including PARROT-Diverse with 28,003 translations and PARROT-Simple with 5,306 representative samples.
arXiv Detail & Related papers (2025-09-27T14:41:13Z) - A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? [32.84561352339466]
We provide a review of Text-to- translation techniques powered by Large Language Models (LLMs)<n>We discuss the research challenges and open problems of Text-to- evaluation in the LLMs era.
arXiv Detail & Related papers (2024-08-09T14:59:36Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - Prompting GPT-3.5 for Text-to-SQL with De-semanticization and Skeleton
Retrieval [17.747079214502673]
Text-to- is a task that converts a natural language question into a structured query language () to retrieve information from a database.
In this paper, we propose an LLM-based framework for Text-to- which retrieves helpful demonstration examples to prompt LLMs.
We design a de-semanticization mechanism that extracts question skeletons, allowing us to retrieve similar examples based on their structural similarity.
arXiv Detail & Related papers (2023-04-26T06:02:01Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - STAR: SQL Guided Pre-Training for Context-dependent Text-to-SQL Parsing [64.80483736666123]
We propose a novel pre-training framework STAR for context-dependent text-to- parsing.
In addition, we construct a large-scale context-dependent text-to-the-art conversation corpus to pre-train STAR.
Extensive experiments show that STAR achieves new state-of-the-art performance on two downstream benchmarks.
arXiv Detail & Related papers (2022-10-21T11:30:07Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset [39.78074639729293]
CHASE contains 2,003 sessions manually constructed from scratch (CHASE-C) and 3,456 sessions translated from English (CHASE-T)
We find the two parts are highly discrepant and incompatible as training and evaluation data.
In this work, we present Se, yet another large-scale session-level text-to- parsing dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch.
arXiv Detail & Related papers (2022-08-26T15:11:10Z) - Data Augmentation with Hierarchical SQL-to-Question Generation for
Cross-domain Text-to-SQL Parsing [40.65143087243074]
This paper presents a simple yet effective data augmentation framework.
First, given a database, we automatically produce a large amount ofsql queries based on an abstract syntax tree grammar citeyintranx.
Second, we propose a hierarchicalsql-to-question generation model to obtain high-quality natural language questions.
arXiv Detail & Related papers (2021-03-03T07:37:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.