Data Transformation to Construct a Dataset for Generating
Entity-Relationship Model from Natural Language
- URL: http://arxiv.org/abs/2312.13694v1
- Date: Thu, 21 Dec 2023 09:45:13 GMT
- Title: Data Transformation to Construct a Dataset for Generating
Entity-Relationship Model from Natural Language
- Authors: Zhenwen Li, Jian-Guang Lou, Tao Xie
- Abstract summary: In order to reduce the manual cost of ER models, recent approaches have been proposed to address the task of NL2ERM.
These approaches are typically rule-based ones, which rely on rigid rules.
Despite having better generalization than rule-based approaches, deep-based models are lacking for NL2ERM due to lacking a large-scale dataset.
- Score: 39.53954130028595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In order to reduce the manual cost of designing ER models, recent approaches
have been proposed to address the task of NL2ERM, i.e., automatically
generating entity-relationship (ER) models from natural language (NL)
utterances such as software requirements. These approaches are typically
rule-based ones, which rely on rigid heuristic rules; these approaches cannot
generalize well to various linguistic ways of describing the same requirement.
Despite having better generalization capability than rule-based approaches,
deep-learning-based models are lacking for NL2ERM due to lacking a large-scale
dataset. To address this issue, in this paper, we report our insight that there
exists a high similarity between the task of NL2ERM and the increasingly
popular task of text-to-SQL, and propose a data transformation algorithm that
transforms the existing data of text-to-SQL into the data of NL2ERM. We apply
our data transformation algorithm on Spider, one of the most popular
text-to-SQL datasets, and we also collect some data entries with different NL
types, to obtain a large-scale NL2ERM dataset. Because NL2ERM can be seen as a
special information extraction (IE) task, we train two state-of-the-art IE
models on our dataset. The experimental results show that both the two models
achieve high performance and outperform existing baselines.
Related papers
- A Survey of NL2SQL with Large Language Models: Where are we, and where are we going? [32.84561352339466]
We provide a review of NL2 techniques powered by Large Language Models (LLMs)
We discuss the research challenges and open problems of NL2 in the LLMs era.
arXiv Detail & Related papers (2024-08-09T14:59:36Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - Automated Data Visualization from Natural Language via Large Language Models: An Exploratory Study [41.84915013818794]
The Natural Language to Visualization (NL2Vis) task aims to transform natural-language descriptions into visual representations for a grounded table.
Many deep learning-based approaches have been developed for NL2Vis, but challenges persist in visualizing data sourced from unseen databases or spanning multiple tables.
Taking inspiration from the remarkable generation capabilities of Large Language Models (LLMs), this paper conducts an empirical study to evaluate their potential in generating visualizations.
arXiv Detail & Related papers (2024-04-26T03:25:35Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - SPSQL: Step-by-step Parsing Based Framework for Text-to-SQL Generation [13.196264569882777]
The current mainstream end-to-end Text2 model is not only difficult to build due to its complex structure and high requirements for training data, but also difficult to adjust due to massive parameters.
This paper proposes a pipeline method: SP Experiments to achieve the desired result.
We construct the dataset based on the marketing business data of the State Grid Corporation of China.
arXiv Detail & Related papers (2023-05-10T10:01:36Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.