Data-to-Value: An Evaluation-First Methodology for Natural Language
Projects
- URL: http://arxiv.org/abs/2201.07725v1
- Date: Wed, 19 Jan 2022 17:04:52 GMT
- Title: Data-to-Value: An Evaluation-First Methodology for Natural Language
Projects
- Authors: Jochen L. Leidner
- Abstract summary: "Data to Value" (D2V) is a new methodology for big data text analytics projects.
It is guided by a detailed catalog of questions in order to avoid a disconnect between big data text analytics project team and the topic.
- Score: 3.9378507882929554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Big data, i.e. collecting, storing and processing of data at scale, has
recently been possible due to the arrival of clusters of commodity computers
powered by application-level distributed parallel operating systems like
HDFS/Hadoop/Spark, and such infrastructures have revolutionized data mining at
scale. For data mining project to succeed more consistently, some methodologies
were developed (e.g. CRISP-DM, SEMMA, KDD), but these do not account for (1)
very large scales of processing, (2) dealing with textual (unstructured) data
(i.e. Natural Language Processing (NLP, "text analytics"), and (3)
non-technical considerations (e.g. legal, ethical, project managerial aspects).
To address these shortcomings, a new methodology, called "Data to Value"
(D2V), is introduced, which is guided by a detailed catalog of questions in
order to avoid a disconnect of big data text analytics project team with the
topic when facing rather abstract box-and-arrow diagrams commonly associated
with methodologies.
Related papers
- Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - Leveraging Data Augmentation for Process Information Extraction [0.0]
We investigate the application of data augmentation for natural language text data.
Data augmentation is an important component in enabling machine learning methods for the task of business process model generation from natural language text.
arXiv Detail & Related papers (2024-04-11T06:32:03Z) - DataAgent: Evaluating Large Language Models' Ability to Answer Zero-Shot, Natural Language Queries [0.0]
We evaluate OpenAI's GPT-3.5 as a "Language Data Scientist" (LDS)
The model was tested on a diverse set of benchmark datasets to evaluate its performance across multiple standards.
arXiv Detail & Related papers (2024-03-29T22:59:34Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Utilizing Domain Knowledge: Robust Machine Learning for Building Energy
Prediction with Small, Inconsistent Datasets [1.1081836812143175]
The demand for a huge amount of data for machine learning (ML) applications is currently a bottleneck.
We propose a method to combine prior knowledge with data-driven methods to significantly reduce their data dependency.
CBML as the knowledge-encoded data-driven method is examined in the context of energy-efficient building engineering.
arXiv Detail & Related papers (2023-01-23T08:56:11Z) - Investigation of Topic Modelling Methods for Understanding the Reports
of the Mining Projects in Queensland [2.610470075814367]
In the mining industry, many reports are generated in the project management process.
Document clustering is a powerful approach to cope with the problem.
Three methods, Latent Dirichlet Allocation (LDA), Nonnegative Matrix Factorization (NMF), and Nonnegative Factorization (NTF) are compared.
arXiv Detail & Related papers (2021-11-05T15:52:03Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Data-to-text Generation with Macro Planning [61.265321323312286]
We propose a neural model with a macro planning stage followed by a generation stage reminiscent of traditional methods.
Our approach outperforms competitive baselines in terms of automatic and human evaluation.
arXiv Detail & Related papers (2021-02-04T16:32:57Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.