Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation
- URL: http://arxiv.org/abs/2509.09684v1
- Date: Mon, 18 Aug 2025 01:25:41 GMT
- Title: Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation
- Authors: Bruno Yui Yamate, Thais Rodrigues Neubauer, Marcelo Fantinato, Sarajane Marques Peres,
- Abstract summary: This paper introduces text-2--4-PM, a benchmark dataset for the text-to-four task in the process mining domain.<n>The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205sql statements, and ten qualifiers.<n>The results show that text-2--4-PM supports evaluation of text-to-four implementations, offering broader applicability for semantic parsing and other natural language processing tasks.
- Score: 0.10499611180329804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces text-2-SQL-4-PM, a bilingual (Portuguese-English) benchmark dataset designed for the text-to-SQL task in the process mining domain. Text-to-SQL conversion facilitates natural language querying of databases, increasing accessibility for users without SQL expertise and productivity for those that are experts. The text-2-SQL-4-PM dataset is customized to address the unique challenges of process mining, including specialized vocabularies and single-table relational structures derived from event logs. The dataset comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers. Methods include manual curation by experts, professional translations, and a detailed annotation process to enable nuanced analyses of task complexity. Additionally, a baseline study using GPT-3.5 Turbo demonstrates the feasibility and utility of the dataset for text-to-SQL applications. The results show that text-2-SQL-4-PM supports evaluation of text-to-SQL implementations, offering broader applicability for semantic parsing and other natural language processing tasks.
Related papers
- Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation [25.638927795540454]
We introduce the Text-to-No task, which aims to convert natural language queries into accessible queries.<n>To promote research in this area, we released a large-scale and open-source dataset for this task, named TEND (short interfaces for Text-to-No dataset)<n>We also designed a SLM (Small Language Model)-assisted and RAG (Retrieval-augmented Generation)-assisted multi-step framework called SMART, which is specifically designed for Text-to-No conversion.
arXiv Detail & Related papers (2025-02-16T17:01:48Z) - A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? [32.84561352339466]
We provide a review of Text-to- translation techniques powered by Large Language Models (LLMs)<n>We discuss the research challenges and open problems of Text-to- evaluation in the LLMs era.
arXiv Detail & Related papers (2024-08-09T14:59:36Z) - A Survey on Employing Large Language Models for Text-to-SQL Tasks [9.527891544418805]
We present an overall analysis of the methods and various models evaluated on well-known datasets.<n>We discuss the challenges and future directions in this field.
arXiv Detail & Related papers (2024-07-21T14:48:23Z) - Enhancing Text-to-SQL Translation for Financial System Design [5.248014305403357]
We consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks.
We propose two novel metrics that were designed to adequately measure the similarity between relational queries.
arXiv Detail & Related papers (2023-12-22T14:34:19Z) - SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs)
With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses.
With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Can LLM Already Serve as A Database Interface? A BIg Bench for
Large-Scale Database Grounded Text-to-SQLs [89.68522473384522]
We present Bird, a big benchmark for large-scale database grounded in text-to-efficient tasks.
Our emphasis on database values highlights the new challenges of dirty database contents.
Even the most effective text-to-efficient models, i.e. ChatGPT, achieves only 40.08% in execution accuracy.
arXiv Detail & Related papers (2023-05-04T19:02:29Z) - A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future
Directions [102.8606542189429]
The goal of text-to-corpora parsing is to convert a natural language (NL) question to its corresponding structured query language () based on the evidences provided by databases.
Deep neural networks have significantly advanced this task by neural generation models, which automatically learn a mapping function from an input NL question to an output query.
arXiv Detail & Related papers (2022-08-29T14:24:13Z) - SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset [39.78074639729293]
CHASE contains 2,003 sessions manually constructed from scratch (CHASE-C) and 3,456 sessions translated from English (CHASE-T)
We find the two parts are highly discrepant and incompatible as training and evaluation data.
In this work, we present Se, yet another large-scale session-level text-to- parsing dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch.
arXiv Detail & Related papers (2022-08-26T15:11:10Z) - Weakly Supervised Text-to-SQL Parsing through Question Decomposition [53.22128541030441]
We take advantage of the recently proposed question meaning representation called QDMR.
Given questions, their QDMR structures (annotated by non-experts or automatically predicted) and the answers, we are able to automatically synthesizesql queries.
Our results show that the weakly supervised models perform competitively with those trained on NL- benchmark data.
arXiv Detail & Related papers (2021-12-12T20:02:42Z) - "What Do You Mean by That?" A Parser-Independent Interactive Approach
for Enhancing Text-to-SQL [49.85635994436742]
We include human in the loop and present a novel-independent interactive approach (PIIA) that interacts with users using multi-choice questions.
PIIA is capable of enhancing the text-to-domain performance with limited interaction turns by using both simulation and human evaluation.
arXiv Detail & Related papers (2020-11-09T02:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.