Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
- URL: http://arxiv.org/abs/2501.11498v1
- Date: Mon, 20 Jan 2025 14:06:40 GMT
- Title: Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
- Authors: Salmane Chafik, Saad Ezzini, Ismail Berrada,
- Abstract summary: This work introduces the first large-scale, cross-domain text-to-IDER- dataset in an Arabic dialect.
It consists of 9,428 NLQ- pairs across 69 databases in various domains.
The dataset also incorporates the complexities of the Moroccan dialect, which is known for its source languages.
- Score: 5.762345156477737
- License:
- Abstract: The task of converting natural language questions (NLQs) into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source languages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.
Related papers
- Ar-Spider: Text-to-SQL in Arabic [11.463438573648297]
This paper introduces Ar-Spider 1, the first Arabic cross-language text-to-domain dataset.
Due to the unique nature of the language, two major challenges have been encountered, namely linguistic and structural challenges.
We propose the similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2 and 1.06% for LGE and closes the gap between Arabic and English languages to 7.73%.
arXiv Detail & Related papers (2024-02-22T23:11:17Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - Deep Learning Driven Natural Languages Text to SQL Query Conversion: A
Survey [2.309914459672557]
In this paper, we try to present a holistic overview of 24 recent neural network models studied in the last couple of years.
We also give an overview of 11 datasets that are widely used to train models for TEXT2 technologies.
arXiv Detail & Related papers (2022-08-08T20:54:34Z) - "What Do You Mean by That?" A Parser-Independent Interactive Approach
for Enhancing Text-to-SQL [49.85635994436742]
We include human in the loop and present a novel-independent interactive approach (PIIA) that interacts with users using multi-choice questions.
PIIA is capable of enhancing the text-to-domain performance with limited interaction turns by using both simulation and human evaluation.
arXiv Detail & Related papers (2020-11-09T02:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.