Text2Struct: A Machine Learning Pipeline for Mining Structured Data from
Text
- URL: http://arxiv.org/abs/2212.09044v2
- Date: Tue, 20 Dec 2022 21:49:18 GMT
- Title: Text2Struct: A Machine Learning Pipeline for Mining Structured Data from
Text
- Authors: Chaochao Zhou and Bo Yang
- Abstract summary: This paper presents an end-to-end machine learning pipeline, Text2Struct.
It includes a text annotation scheme, training data processing, and machine learning implementation.
It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models.
- Score: 4.709764624933227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many analysis and prediction tasks require the extraction of structured data
from unstructured texts. To solve it, this paper presents an end-to-end machine
learning pipeline, Text2Struct, including a text annotation scheme, training
data processing, and machine learning implementation. We formulated the mining
problem as the extraction of metrics and units associated with numerals in the
text. The Text2Struct was evaluated on an annotated text dataset collected from
abstracts of medical publications regarding thrombectomy. In terms of
prediction performance, a dice coefficient of 0.82 was achieved on the test
dataset. By random sampling, most predicted relations between numerals and
entities were well matched to the ground-truth annotations. These results show
that the Text2Struct is viable for the mining of structured data from text
without special templates or patterns. It is anticipated to further improve the
pipeline by expanding the dataset and investigating other machine learning
models. A code demonstration can be found at:
https://github.com/zcc861007/CourseProject
Related papers
- Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Unifying Structured Data as Graph for Data-to-Text Pre-Training [69.96195162337793]
Data-to-text (D2T) generation aims to transform structured data into natural language text.
Data-to-text pre-training has proved to be powerful in enhancing D2T generation.
We propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer.
arXiv Detail & Related papers (2024-01-02T12:23:49Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Faithful Low-Resource Data-to-Text Generation through Cycle Training [14.375070014155817]
Methods to generate text from structured data have advanced significantly in recent years.
Cycle training uses two models which are inverses of each other.
We show that cycle training achieves nearly the same performance as fully supervised approaches.
arXiv Detail & Related papers (2023-05-24T06:44:42Z) - HiStruct+: Improving Extractive Text Summarization with Hierarchical
Structure Information [0.6443952406204634]
We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model.
Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively.
arXiv Detail & Related papers (2022-03-17T21:49:26Z) - DataWords: Getting Contrarian with Text, Structured Data and
Explanations [0.0]
We represent structured data by text sentences, DataWords, so that similar data items are mapped into the same sentence.
This permits modeling a mixture of text and structured data by using only text-modeling algorithms.
arXiv Detail & Related papers (2021-11-09T19:52:13Z) - Automated News Summarization Using Transformers [4.932130498861987]
We will be presenting a comprehensive comparison of a few transformer architecture based pre-trained models for text summarization.
For analysis and comparison, we have used the BBC news dataset that contains text data that can be used for summarization and human generated summaries.
arXiv Detail & Related papers (2021-04-23T04:22:33Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - Learning to Summarize Passages: Mining Passage-Summary Pairs from
Wikipedia Revision Histories [110.54963847339775]
We propose a method for automatically constructing a passage-to-summary dataset by mining the Wikipedia page revision histories.
In particular, the method mines the main body passages and the introduction sentences which are added to the pages simultaneously.
The constructed dataset contains more than one hundred thousand passage-summary pairs.
arXiv Detail & Related papers (2020-04-06T12:11:50Z) - Selective Attention Encoders by Syntactic Graph Convolutional Networks
for Document Summarization [21.351111598564987]
We propose a graph to connect the parsing trees from the sentences in a document and utilize the stacked graph convolutional networks (GCNs) to learn the syntactic representation for a document.
The proposed GCNs based selective attention approach outperforms the baselines and achieves the state-of-the-art performance on the dataset.
arXiv Detail & Related papers (2020-03-18T01:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.