Related papers: LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO

LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO

URL: http://arxiv.org/abs/2506.13785v1
Date: Wed, 11 Jun 2025 04:04:13 GMT
Title: LLM-Driven Data Generation and a Novel Soft Metric for Evaluating Text-to-SQL in Aviation MRO
Authors: Patrick Sutanto, Jonathan Kenrick, Max Lorenz, Joan Santoso,
Abstract summary: We introduce a novel F1-score-based'soft' metric that quantifies the informational overlap between generated and ground-truth results.<n>We demonstrate our contributions through an empirical evaluation on an authentic MRO database.
Score: 0.6374763930914525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The application of Large Language Models (LLMs) to text-to-SQL tasks promises to democratize data access, particularly in critical industries like aviation Maintenance, Repair, and Operation (MRO). However, progress is hindered by two key challenges: the rigidity of conventional evaluation metrics such as execution accuracy, which offer coarse, binary feedback, and the scarcity of domain-specific evaluation datasets. This paper addresses these gaps. To enable more nuanced assessment, we introduce a novel F1-score-based 'soft' metric that quantifies the informational overlap between generated and ground-truth SQL results. To address data scarcity, we propose an LLM-driven pipeline that synthesizes realistic question-SQL pairs from database schemas. We demonstrate our contributions through an empirical evaluation on an authentic MRO database. Our experiments show that the proposed soft metric provides more insightful performance analysis than strict accuracy, and our data generation technique is effective in creating a domain-specific benchmark. Together, these contributions offer a robust framework for evaluating and advancing text-to-SQL systems in specialized environments.

Related papers

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL [39.76924093980244]
APEX- verbalize is a framework that shifts the paradigm from passive translation to agentic exploration.<n>Our framework employs a hypothesis-verification loop to ground model reasoning in real data.
arXiv Detail & Related papers (2026-02-11T07:50:47Z)
Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks [21.891522433628893]
Large language models (LLMs) are increasingly powering Text-to- (Text2) systems, enabling non-expert users to query industrial databases using natural language.<n>While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain.<n>This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2 systems.
arXiv Detail & Related papers (2025-10-13T01:29:54Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation [25.638927795540454]
We introduce the Text-to-No task, which aims to convert natural language queries into accessible queries.<n>To promote research in this area, we released a large-scale and open-source dataset for this task, named TEND (short interfaces for Text-to-No dataset)<n>We also designed a SLM (Small Language Model)-assisted and RAG (Retrieval-augmented Generation)-assisted multi-step framework called SMART, which is specifically designed for Text-to-No conversion.
arXiv Detail & Related papers (2025-02-16T17:01:48Z)
Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries. Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD. This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z)
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark [8.445403382578167]
This paper introduces FLEX (False-Lesscution EXecution), a novel approach to evaluating text-to-technical systems. Our metric improves agreement with human experts with comprehensive context and sophisticated criteria. This work contributes to a more accurate and nuanced evaluation of text-to-technical systems, potentially reshaping our understanding of state-of-the-art performance in this field.
arXiv Detail & Related papers (2024-09-24T01:40:50Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios [51.66718740300016]
TableLLM is a robust large language model (LLM) with 8 billion parameters.<n>TableLLM is purpose-built for proficiently handling data manipulation tasks.<n>We have released the model checkpoint, source code, benchmarks, and a web application for user interaction.
arXiv Detail & Related papers (2024-03-28T11:21:12Z)
Enhancing Text-to-SQL Translation for Financial System Design [5.248014305403357]
We consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. We propose two novel metrics that were designed to adequately measure the similarity between relational queries.
arXiv Detail & Related papers (2023-12-22T14:34:19Z)
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation [76.76046657162306]
Large language models (LLMs) have emerged as a new paradigm for Text-to- task. Large language models (LLMs) have emerged as a new paradigm for Text-to- task.
arXiv Detail & Related papers (2023-08-29T14:59:54Z)
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended) [53.95151604061761]
This paper introduces the framework for enhancing Text-to- filtering using large language models (LLMs) With few-shot prompting, we explore the effectiveness of consistency decoding with execution-based error analyses. With instruction fine-tuning, we delve deep in understanding the critical paradigms that influence the performance of tuned LLMs.
arXiv Detail & Related papers (2023-05-26T21:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.