DataWords: Getting Contrarian with Text, Structured Data and
Explanations
- URL: http://arxiv.org/abs/2111.05384v1
- Date: Tue, 9 Nov 2021 19:52:13 GMT
- Title: DataWords: Getting Contrarian with Text, Structured Data and
Explanations
- Authors: Stephen I. Gallant and Mirza Nasir Hossain
- Abstract summary: We represent structured data by text sentences, DataWords, so that similar data items are mapped into the same sentence.
This permits modeling a mixture of text and structured data by using only text-modeling algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our goal is to build classification models using a combination of free-text
and structured data. To do this, we represent structured data by text
sentences, DataWords, so that similar data items are mapped into the same
sentence. This permits modeling a mixture of text and structured data by using
only text-modeling algorithms. Several examples illustrate that it is possible
to improve text classification performance by first running extraction tools
(named entity recognition), then converting the output to DataWords, and adding
the DataWords to the original text -- before model building and classification.
This approach also allows us to produce explanations for inferences in terms of
both free text and structured data.
Related papers
- Unifying Structured Data as Graph for Data-to-Text Pre-Training [69.96195162337793]
Data-to-text (D2T) generation aims to transform structured data into natural language text.
Data-to-text pre-training has proved to be powerful in enhancing D2T generation.
We propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer.
arXiv Detail & Related papers (2024-01-02T12:23:49Z) - Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain [53.22419717434372]
We propose a new task, namely stylized data-to-text generation, whose aim is to generate coherent text according to a specific style.
This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples.
We propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation.
arXiv Detail & Related papers (2023-05-05T03:02:41Z) - Text2Struct: A Machine Learning Pipeline for Mining Structured Data from
Text [4.709764624933227]
This paper presents an end-to-end machine learning pipeline, Text2Struct.
It includes a text annotation scheme, training data processing, and machine learning implementation.
It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models.
arXiv Detail & Related papers (2022-12-18T09:31:36Z) - Data-to-text Generation with Variational Sequential Planning [74.3955521225497]
We consider the task of data-to-text generation, which aims to create textual output from non-linguistic input.
We propose a neural model enhanced with a planning component responsible for organizing high-level information in a coherent and meaningful way.
We infer latent plans sequentially with a structured variational model, while interleaving the steps of planning and generation.
arXiv Detail & Related papers (2022-02-28T13:17:59Z) - Modelling the semantics of text in complex document layouts using graph
transformer networks [0.0]
We propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span.
We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information.
arXiv Detail & Related papers (2022-02-18T11:49:06Z) - A Benchmark Corpus for the Detection of Automatically Generated Text in
Academic Publications [0.02578242050187029]
This paper presents two datasets comprised of artificially generated research content.
In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers.
The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model.
We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE.
arXiv Detail & Related papers (2022-02-04T08:16:56Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Data-to-Text Generation with Iterative Text Editing [3.42658286826597]
We present a novel approach to data-to-text generation based on iterative text editing.
We first transform data items to text using trivial templates, and then we iteratively improve the resulting text by a neural model trained for the sentence fusion task.
The output of the model is filtered by a simple and reranked with an off-the-shelf pre-trained language model.
arXiv Detail & Related papers (2020-11-03T13:32:38Z) - DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs)
We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title.
Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.