Database Entity Recognition with Data Augmentation and Deep Learning
- URL: http://arxiv.org/abs/2508.19372v1
- Date: Tue, 26 Aug 2025 19:05:40 GMT
- Title: Database Entity Recognition with Data Augmentation and Deep Learning
- Authors: Zikun Fu, Chen Yang, Kourosh Davoudi, Ken Q. Pu,
- Abstract summary: We present a human-annotated benchmark for DB-ER task, derived from popular text-to-sql benchmarks.<n>We also present a novel data augmentation procedure that leverages automatic annotation of NLQs.
- Score: 2.7396543860784788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the challenge of Database Entity Recognition (DB-ER) in Natural Language Queries (NLQ). We present several key contributions to advance this field: (1) a human-annotated benchmark for DB-ER task, derived from popular text-to-sql benchmarks, (2) a novel data augmentation procedure that leverages automatic annotation of NLQs based on the corresponding SQL queries which are available in popular text-to-SQL benchmarks, (3) a specialized language model based entity recognition model using T5 as a backbone and two down-stream DB-ER tasks: sequence tagging and token classification for fine-tuning of backend and performing DB-ER respectively. We compared our DB-ER tagger with two state-of-the-art NER taggers, and observed better performance in both precision and recall for our model. The ablation evaluation shows that data augmentation boosts precision and recall by over 10%, while fine-tuning of the T5 backbone boosts these metrics by 5-10%.
Related papers
- Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation [54.53145282349042]
We introduce DSR-sourced, a textbfDual-textbfS textbfReasoning framework that models Text-to-context as an interaction between an adaptive context state and a progressive generation state.<n>Without any post-training or in-context examples, DSR-sourced achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set.
arXiv Detail & Related papers (2025-11-26T13:52:50Z) - SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes [10.672822970425404]
We propose SDE-, a framework that enables large language models to perform self-driven exploration of databases during inference.<n>Unlike prior methods, SDE- operates in a zero-shot setting, without relying on any question-context pairs as in-context demonstrations.
arXiv Detail & Related papers (2025-06-08T18:01:26Z) - RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases [34.357399264742526]
We introduce RDB2G-Bench, the first benchmark framework for evaluating graph modeling methods.<n>We benchmark 10 automatic RDB-to-graph modeling methods on 12 tasks about 380x faster than on-the-fly evaluation.<n>Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness.
arXiv Detail & Related papers (2025-06-02T06:34:10Z) - Enhancing LLM Fine-tuning for Text-to-SQLs by SQL Quality Measurement [1.392448435105643]
Text-to-s enables non-expert users to effortlessly retrieve desired information from databases using natural language queries.
Current state-of-the-art (SOTA) models like GPT4 and T5 have shown impressive performance on large-scale benchmarks like BIRD.
This paper proposed a novel approach that only needs SQL Quality to enhance Text-to-s performance.
arXiv Detail & Related papers (2024-10-02T17:21:51Z) - 4DBInfer: A 4D Benchmarking Toolbox for Graph-Centric Predictive Modeling on Relational DBs [67.47600679176963]
RDBs store vast amounts of rich, informative data spread across interconnected tables.
The progress of predictive machine learning models falls behind advances in other domains such as computer vision or natural language processing.
We explore a class of baseline models predicated on converting multi-table datasets into graphs.
We assemble a diverse collection of large-scale RDB datasets and (ii) coincident predictive tasks.
arXiv Detail & Related papers (2024-04-28T15:04:54Z) - UNITE: A Unified Benchmark for Text-to-SQL Evaluation [72.72040379293718]
We introduce a UNIfied benchmark for Text-to-domain systems.
It is composed of publicly available text-to-domain datasets and 29K databases.
Compared to the widely used Spider benchmark, we introduce a threefold increase in SQL patterns.
arXiv Detail & Related papers (2023-05-25T17:19:52Z) - Graphix-T5: Mixing Pre-Trained Transformers with Graph-Aware Layers for
Text-to-SQL Parsing [56.232873134174056]
One of the major challenges in text-to-text parsing is domain generalization, i.e., how to well generalize to unseen databases.
In this work, we explore ways to further augment the pre-trained text-to-text transformer model with specialized components for text-to-text parsing.
To this end, we propose a new architecture GRAPHIX-T5, augmented by some specially-designed graph-aware model with layers.
arXiv Detail & Related papers (2023-01-18T13:29:05Z) - Importance of Synthesizing High-quality Data for Text-to-SQL Parsing [71.02856634369174]
State-of-the-art text-to-weighted algorithms did not further improve on popular benchmarks when trained with augmented synthetic data.
We propose a novel framework that incorporates key relationships from schema, imposes strong typing, and schema-weighted column sampling.
arXiv Detail & Related papers (2022-12-17T02:53:21Z) - N-Best Hypotheses Reranking for Text-To-SQL Systems [6.966624873109535]
Text-to- task maps natural language utterances to structured queries.
State-of-the-art (SOTA) systems rely on finetuning large, pre-trained language models.
Findings show significant potential improvements with reranking.
arXiv Detail & Related papers (2022-10-19T15:35:06Z) - Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic
Parsing [110.97778888305506]
BRIDGE represents the question and DB schema in a tagged sequence where a subset of the fields are augmented with cell values mentioned in the question.
BRIDGE attained state-of-the-art performance on popular cross-DB text-to- relational benchmarks.
Our analysis shows that BRIDGE effectively captures the desired cross-modal dependencies and has the potential to generalize to more text-DB related tasks.
arXiv Detail & Related papers (2020-12-23T12:33:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.