BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
- URL: http://arxiv.org/abs/2406.07860v1
- Date: Wed, 12 Jun 2024 04:22:27 GMT
- Title: BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
- Authors: Rahul Kumar, Amar Raja Dibbu, Shrutendra Harsola, Vignesh Subrahmaniam, Ashutosh Modi,
- Abstract summary: We propose a new large-scale Text-to- dataset for the accounting and financial domain: Book.
The dataset consists of 100k natural language queries- pairs, and accounting databases of 1 million records.
- Score: 4.671854744910768
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.
Related papers
- FinStat2SQL: A Text2SQL Pipeline for Financial Statement Analysis [0.0]
FinStat2 is a lightweight text2sql pipeline enabling natural language queries over financial statements.<n>We build a domain-specific database and evaluate models on a synthetic QA.<n>A fine-tuned 7B model achieves 61.33% accuracy with sub-4-second response times on consumer hardware.
arXiv Detail & Related papers (2025-06-29T14:55:21Z) - OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale [31.852909145101677]
We propose a novel and scalable text-to-data framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention.
We introduce Syn-2.5M, the first million-scale text-to-dataset, containing 2.5 million samples spanning over 16,000 synthetic databases.
We develop Omni, a powerful open-source text-to-model available in three sizes: 7B, 14B, and 32B.
arXiv Detail & Related papers (2025-03-04T03:30:56Z) - Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation [25.638927795540454]
We introduce the Text-to-No task, which aims to convert natural language queries into accessible queries.
To promote research in this area, we released a large-scale and open-source dataset for this task, named TEND (short interfaces for Text-to-No dataset)
We also designed a SLM (Small Language Model)-assisted and RAG (Retrieval-augmented Generation)-assisted multi-step framework called SMART, which is specifically designed for Text-to-No conversion.
arXiv Detail & Related papers (2025-02-16T17:01:48Z) - Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows [64.94146689665628]
Spider 2.0 is an evaluation framework for real-world text-to-sql problems derived from enterprise-level database use cases.
The databases in Spider 2.0 are sourced from real data applications, often containing over 1,000 columns and stored in local or cloud database systems such as BigQuery and Snowflake.
We show that solving problems in Spider 2.0 frequently requires understanding and searching through database metadata, dialect documentation, and even project-levels.
arXiv Detail & Related papers (2024-11-12T12:52:17Z) - Text2SQL is Not Enough: Unifying AI and Databases with TAG [47.45480855418987]
Table-Augmented Generation (TAG) is a paradigm for answering natural language questions over databases.
We develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly.
arXiv Detail & Related papers (2024-08-27T00:50:14Z) - Fine-Tuning Language Models for Context-Specific SQL Query Generation [0.0]
This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language intosql queries.
We introduce models specialized in generatingsql queries, trained on synthetic datasets tailored to the Snowflake SQL and Google dialects.
Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints.
The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GP
arXiv Detail & Related papers (2023-12-04T18:04:27Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - Deep Learning Driven Natural Languages Text to SQL Query Conversion: A
Survey [2.309914459672557]
In this paper, we try to present a holistic overview of 24 recent neural network models studied in the last couple of years.
We also give an overview of 11 datasets that are widely used to train models for TEXT2 technologies.
arXiv Detail & Related papers (2022-08-08T20:54:34Z) - KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers [26.15889661083109]
We present KDBaggleQA, a new cross-domain evaluation dataset of real Web databases.
We show that KDBaggleQA presents a challenge to state-of-the-art zero-shots but that a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%.
arXiv Detail & Related papers (2021-06-22T00:08:03Z) - Structure-Grounded Pretraining for Text-to-SQL [75.19554243393814]
We present a novel weakly supervised StructureStrued pretraining framework (G) for text-to-LARGE.
We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder.
arXiv Detail & Related papers (2020-10-24T04:35:35Z) - Data Agnostic RoBERTa-based Natural Language to SQL Query Generation [0.0]
The NL2 task aims at finding deep learning approaches to solve the problem converting by natural language questions into valid queries.
We have presented an approach with data privacy at its core.
Although we have not achieved state of the art results, we have eliminated the need for the table right from the training of the model.
arXiv Detail & Related papers (2020-10-11T13:18:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.