Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation
with Semantic Fidelity
- URL: http://arxiv.org/abs/2004.06577v2
- Date: Tue, 10 Nov 2020 19:09:46 GMT
- Title: Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation
with Semantic Fidelity
- Authors: Hamza Harkous, Isabel Groves, Amir Saffari
- Abstract summary: We present DataTuner, a neural, end-to-end data-to-text generation system.
We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity.
We show that DataTuner achieves state of the art results on the automated metrics across four major D2T datasets.
- Score: 3.8673630752805432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end neural data-to-text (D2T) generation has recently emerged as an
alternative to pipeline-based architectures. However, it has faced challenges
in generalizing to new domains and generating semantically consistent text. In
this work, we present DataTuner, a neural, end-to-end data-to-text generation
system that makes minimal assumptions about the data representation and the
target domain. We take a two-stage generation-reranking approach, combining a
fine-tuned language model with a semantic fidelity classifier. Each of our
components is learnt end-to-end without the need for dataset-specific
heuristics, entity delexicalization, or post-processing. We show that DataTuner
achieves state of the art results on the automated metrics across four major
D2T datasets (LDC2017T10, WebNLG, ViGGO, and Cleaned E2E), with a fluency
assessed by human annotators nearing or exceeding the human-written reference
texts. We further demonstrate that the model-based semantic fidelity scorer in
DataTuner is a better assessment tool compared to traditional, heuristic-based
measures. Our generated text has a significantly better semantic fidelity than
the state of the art across all four datasets
Related papers
- VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models [5.713983191152314]
VTechAGP is the first academic-to-general-audience text paraphrase dataset.
We also propose a novel dynamic soft prompt generative language model, DSPT5.
For training, we leverage a contrastive-generative loss function to learn the keyword in the dynamic prompt.
arXiv Detail & Related papers (2024-11-07T16:06:00Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model [4.474834288759608]
Graph-to-Text (G2T) generation involves verbalizing structured graphs into natural language.
The scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research.
We introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method.
arXiv Detail & Related papers (2024-09-11T08:16:20Z) - Triples-to-isiXhosa (T2X): Addressing the Challenges of Low-Resource
Agglutinative Data-to-Text Generation [9.80836683456026]
We tackle data-to-text for isiXhosa, which is low-resource and agglutinative.
We introduce Triples-to-isiXhosa (T2X), a new dataset based on a subset of WebNLG.
We develop an evaluation framework for T2X that measures how accurately generated text describes the data.
arXiv Detail & Related papers (2024-03-12T11:53:27Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z) - Data-to-Text Generation with Iterative Text Editing [3.42658286826597]
We present a novel approach to data-to-text generation based on iterative text editing.
We first transform data items to text using trivial templates, and then we iteratively improve the resulting text by a neural model trained for the sentence fusion task.
The output of the model is filtered by a simple and reranked with an off-the-shelf pre-trained language model.
arXiv Detail & Related papers (2020-11-03T13:32:38Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.