AraSpider: Democratizing Arabic-to-SQL
- URL: http://arxiv.org/abs/2402.07448v1
- Date: Mon, 12 Feb 2024 07:11:13 GMT
- Title: AraSpider: Democratizing Arabic-to-SQL
- Authors: Ahmed Heakl, Youssef Mohamed, and Ahmed B. Zaky
- Abstract summary: This study presents AraNLP, the first Arabic version of the Spider dataset, aimed at improving natural language processing (Spider) in the Arabic-speaking community.
- Score: 1.082634245716027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study presents AraSpider, the first Arabic version of the Spider
dataset, aimed at improving natural language processing (NLP) in the
Arabic-speaking community. Four multilingual translation models were tested for
their effectiveness in translating English to Arabic. Additionally, two models
were assessed for their ability to generate SQL queries from Arabic text. The
results showed that using back translation significantly improved the
performance of both ChatGPT 3.5 and SQLCoder models, which are considered top
performers on the Spider dataset. Notably, ChatGPT 3.5 demonstrated
high-quality translation, while SQLCoder excelled in text-to-SQL tasks. The
study underscores the importance of incorporating contextual schema and
employing back translation strategies to enhance model performance in Arabic
NLP tasks. Moreover, the provision of detailed methodologies for
reproducibility and translation of the dataset into other languages highlights
the research's commitment to promoting transparency and collaborative knowledge
sharing in the field. Overall, these contributions advance NLP research,
empower Arabic-speaking researchers, and enrich the global discourse on
language comprehension and database interrogation.
Related papers
- GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - ArabicaQA: A Comprehensive Dataset for Arabic Question Answering [13.65056111661002]
We introduce ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic.
We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus.
arXiv Detail & Related papers (2024-03-26T16:37:54Z) - Ar-Spider: Text-to-SQL in Arabic [11.463438573648297]
This paper introduces Ar-Spider 1, the first Arabic cross-language text-to-domain dataset.
Due to the unique nature of the language, two major challenges have been encountered, namely linguistic and structural challenges.
We propose the similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2 and 1.06% for LGE and closes the gap between Arabic and English languages to 7.73%.
arXiv Detail & Related papers (2024-02-22T23:11:17Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer [0.0]
A large number of techniques are geared towards the English language.
In this work, we investigated translation tosql when input questions are given in a language different from English.
We changed the RAT-+GAP system by relying on a multilingual BART model.
arXiv Detail & Related papers (2021-10-07T15:08:24Z) - AraBERT: Transformer-based Model for Arabic Language Understanding [0.0]
We pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language.
The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
arXiv Detail & Related papers (2020-02-28T22:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.