CodeS: Towards Building Open-source Language Models for Text-to-SQL
- URL: http://arxiv.org/abs/2402.16347v1
- Date: Mon, 26 Feb 2024 07:00:58 GMT
- Title: CodeS: Towards Building Open-source Language Models for Text-to-SQL
- Authors: Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu,
Renjie Wei, Hongyan Pan, Cuiping Li, Hong Chen
- Abstract summary: We introduce CodeS, a series of pre-trained language models with parameters ranging from 1B to 15B.
CodeS is a fully open language model, which achieves superior accuracy with much smaller parameter sizes.
We conduct comprehensive evaluations on multiple datasets, including the widely used Spider benchmark.
- Score: 42.11113113574589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models have shown promising performance on the task of translating
natural language questions into SQL queries (Text-to-SQL). However, most of the
state-of-the-art (SOTA) approaches rely on powerful yet closed-source large
language models (LLMs), such as ChatGPT and GPT-4, which may have the
limitations of unclear model architectures, data privacy risks, and expensive
inference overheads. To address the limitations, we introduce CodeS, a series
of pre-trained language models with parameters ranging from 1B to 15B,
specifically designed for the text-to-SQL task. CodeS is a fully open-source
language model, which achieves superior accuracy with much smaller parameter
sizes. This paper studies the research challenges in building CodeS. To enhance
the SQL generation abilities of CodeS, we adopt an incremental pre-training
approach using a specifically curated SQL-centric corpus. Based on this, we
address the challenges of schema linking and rapid domain adaptation through
strategic prompt construction and a bi-directional data augmentation technique.
We conduct comprehensive evaluations on multiple datasets, including the widely
used Spider benchmark, the newly released BIRD benchmark, robustness-diagnostic
benchmarks such as Spider-DK, Spider-Syn, Spider-Realistic, and Dr.Spider, as
well as two real-world datasets created for financial and academic
applications. The experimental results show that our CodeS achieves new SOTA
accuracy and robustness on nearly all challenging text-to-SQL benchmarks.
Related papers
- Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [64.94146689665628]
Spider 2.0 is an evaluation framework for real-world text-to-sql problems derived from enterprise-level database use cases.
The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake.
We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-levels.
arXiv Detail & Related papers (2024-11-12T12:52:17Z) - Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - DFIN-SQL: Integrating Focused Schema with DIN-SQL for Superior Accuracy
in Large-Scale Databases [0.0]
This paper introduces DFIN, an innovative extension of DIN-composed (Decomposed-In-Context)
DFIN enhances Text-to-composed conversion by addressing schema linking errors, which are a major source of inaccuracies.
Our evaluation on the BIRD dataset, a challenging real-world benchmark, demonstrates that DFIN not only efficiently but also improves accuracy, achieving a score of 51.69.
arXiv Detail & Related papers (2024-03-01T07:14:45Z) - Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries [4.141402725050671]
This paper is the first in-depth evaluation of the data model robustness of Text-to-- systems in practice.
It is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022.
All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models.
arXiv Detail & Related papers (2024-02-13T10:28:57Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation [13.196264569882777]
The current mainstream end-to-end Text2 model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters.
This paper proposes a pipeline method: SP Experiments to achieve the desired result.
We construct the dataset based on the marketing business data of the State Grid Corporation of China.
arXiv Detail & Related papers (2023-05-10T10:01:36Z) - Towards Generalizable and Robust Text-to-SQL Parsing [77.18724939989647]
We propose a novel TKK framework consisting of Task decomposition, Knowledge acquisition, and Knowledge composition to learn text-to- parsing in stages.
We show that our framework is effective in all scenarios and state-of-the-art performance on the Spider, SParC, and Co. datasets.
arXiv Detail & Related papers (2022-10-23T09:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.