DataGpt-SQL-7B: An Open-Source Language Model for Text-to-SQL
- URL: http://arxiv.org/abs/2409.15985v1
- Date: Tue, 24 Sep 2024 11:38:08 GMT
- Title: DataGpt-SQL-7B: An Open-Source Language Model for Text-to-SQL
- Authors: Lixia Wu, Peng Li, Junhong Lou, Lei Fu,
- Abstract summary: We propose a suite of compact, fine-tuned models and self-refine mechanisms to democratize data access and analysis for non-expert users.
Our system, DataGpt-sql, achieved 87.2% accuracy on the spider-dev.
- Score: 7.76068876576964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In addressing the pivotal role of translating natural language queries into SQL commands, we propose a suite of compact, fine-tuned models and self-refine mechanisms to democratize data access and analysis for non-expert users, mitigating risks associated with closed-source Large Language Models. Specifically, we constructed a dataset of over 20K sample for Text-to-SQL as well as the preference dateset, to improve the efficiency in the domain of SQL generation. To further ensure code validity, a code corrector was integrated into the model. Our system, DataGpt-sql, achieved 87.2\% accuracy on the spider-dev, respectively, showcasing the effectiveness of our solution in text-to-SQL conversion tasks. Our code, data, and models are available at \url{https://github.com/CainiaoTechAi/datagpt-sql-7b}
Related papers
- MSc-SQL: Multi-Sample Critiquing Small Language Models For Text-To-SQL Translation [10.205010004198757]
Text-to-generation enables non-experts to interact with databases via natural language.
Recent advances on large closed-source models like GPT-4 present challenges in accessibility, privacy, and latency.
We focus on developing small, efficient, and open-source text-to-generation models.
arXiv Detail & Related papers (2024-10-16T18:03:24Z) - SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging [30.306023265985658]
We introduce a framework for generating high-quality synthetic training data for any dialect.
We propose a novel Mixture-of-Experts (MoE) that leverages the shared knowledge across dialects.
arXiv Detail & Related papers (2024-08-22T20:50:48Z) - TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question.
We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z) - Fine-Tuning Language Models for Context-Specific SQL Query Generation [0.0]
This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language intosql queries.
We introduce models specialized in generatingsql queries, trained on synthetic datasets tailored to the Snowflake SQL and Google dialects.
Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints.
The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GP
arXiv Detail & Related papers (2023-12-04T18:04:27Z) - SQLPrompt: In-Context Text-to-SQL with Minimal Labeled Data [54.69489315952524]
"Prompt" is designed to improve the few-shot prompting capabilities of Text-to-LLMs.
"Prompt" outperforms previous approaches for in-context learning with few labeled data by a large margin.
We show that emphPrompt outperforms previous approaches for in-context learning with few labeled data by a large margin.
arXiv Detail & Related papers (2023-11-06T05:24:06Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder
for Text-to-SQL Parsers [66.78665327694625]
We propose S$2$, injecting Syntax to question- encoder graph for Text-to- relational parsing.
We also employ the decoupling constraint to induce diverse edge embedding, which further improves the network's performance.
Experiments on the Spider and robustness setting Spider-Syn demonstrate that the proposed approach outperforms all existing methods when pre-training models are used.
arXiv Detail & Related papers (2022-03-14T09:49:15Z) - Data Agnostic RoBERTa-based Natural Language to SQL Query Generation [0.0]
The NL2 task aims at finding deep learning approaches to solve the problem converting by natural language questions into valid queries.
We have presented an approach with data privacy at its core.
Although we have not achieved state of the art results, we have eliminated the need for the table right from the training of the model.
arXiv Detail & Related papers (2020-10-11T13:18:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.