Leveraging Data Augmentation for Process Information Extraction
- URL: http://arxiv.org/abs/2404.07501v1
- Date: Thu, 11 Apr 2024 06:32:03 GMT
- Title: Leveraging Data Augmentation for Process Information Extraction
- Authors: Julian Neuberger, Leonie Doll, Benedict Engelmann, Lars Ackermann, Stefan Jablonski,
- Abstract summary: We investigate the application of data augmentation for natural language text data.
Data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation of process models from readily available data. These include process mining on event logs, and generating business process models from natural language texts. Research in the latter field is regularly faced with the problem of limited data availability, hindering both evaluation and development of new techniques, especially learning-based ones. To overcome this data scarcity issue, in this paper we investigate the application of data augmentation for natural language text data. Data augmentation methods are well established in machine learning for creating new, synthetic data without human assistance. We find that many of these methods are applicable to the task of business process information extraction, improving the accuracy of extraction. Our study shows, that data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text, where currently mostly rule-based systems are still state of the art. Simple data augmentation techniques improved the $F_1$ score of mention extraction by 2.9 percentage points, and the $F_1$ of relation extraction by $4.5$. To better understand how data augmentation alters human annotated texts, we analyze the resulting text, visualizing and discussing the properties of augmented textual data. We make all code and experiments results publicly available.
Related papers
- Assisted Data Annotation for Business Process Information Extraction from Textual Documents [15.770020803430246]
Machine-learning based generation of process models from natural language text process descriptions provides a solution for the time-intensive and expensive process discovery phase.
We present two assistance features to support dataset creation, a recommendation system for identifying process information in the text and visualization of the current state of already identified process information as a graphical business process model.
A controlled user study with 31 participants shows that assisting dataset creators with recommendations lowers all aspects of workload, up to $-51.0%$, and significantly improves annotation quality, up to $+38.9%$.
arXiv Detail & Related papers (2024-10-02T09:14:39Z) - A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models [0.8899670429041453]
We show that generative large language models (LLMs) can solve NLP tasks with very high quality without the need for extensive data.
Based on a novel prompting strategy, we show that LLMs are able to outperform state-of-the-art machine learning approaches.
arXiv Detail & Related papers (2024-07-26T06:39:35Z) - Computational Job Market Analysis with Natural Language Processing [5.117211717291377]
This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions.
We frame the problem, obtaining annotated data, and introducing extraction methodologies.
Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training.
arXiv Detail & Related papers (2024-04-29T14:52:38Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Text2Data: Low-Resource Data Generation with Textual Control [104.38011760992637]
Natural language serves as a common and straightforward control signal for humans to interact seamlessly with machines.
We propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model.
It undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Beyond Rule-based Named Entity Recognition and Relation Extraction for
Process Model Generation from Natural Language Text [0.0]
We present an extension to an existing pipeline to make it entirely data driven.
We demonstrate the competitiveness of our improved pipeline, which not only eliminates the substantial overhead associated with feature engineering and rule definition.
We propose an extension to the PET dataset that incorporates information about linguistic references and a corresponding method for resolving them.
arXiv Detail & Related papers (2023-05-06T07:06:47Z) - Data Augmentation for Neural NLP [0.0]
Data augmentation is a low-cost approach for tackling data scarcity.
This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing.
arXiv Detail & Related papers (2023-02-22T14:47:15Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.