Beyond Accuracy: ROI-driven Data Analytics of Empirical Data
- URL: http://arxiv.org/abs/2009.06492v1
- Date: Mon, 14 Sep 2020 14:49:37 GMT
- Title: Beyond Accuracy: ROI-driven Data Analytics of Empirical Data
- Authors: Gouri Deshpande and Guenther Ruhe
- Abstract summary: It is crucial to consider Return-on-Investment when performing Data Analytics.
This vision paper demonstrates that it is crucial to consider Return-on-Investment when performing Data Analytics.
- Score: 3.5751623095926806
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This vision paper demonstrates that it is crucial to consider
Return-on-Investment (ROI) when performing Data Analytics. Decisions on "How
much analytics is needed"? are hard to answer. ROI could guide for decision
support on the What?, How?, and How Much? analytics for a given problem.
Method: The proposed conceptual framework is validated through two empirical
studies that focus on requirements dependencies extraction in the Mozilla
Firefox project. The two case studies are (i) Evaluation of fine-tuned BERT
against Naive Bayes and Random Forest machine learners for binary dependency
classification and (ii) Active Learning against passive Learning (random
sampling) for REQUIRES dependency extraction. For both the cases, their
analysis investment (cost) is estimated, and the achievable benefit from DA is
predicted, to determine a break-even point of the investigation. Results: For
the first study, fine-tuned BERT performed superior to the Random Forest,
provided that more than 40% of training data is available. For the second,
Active Learning achieved higher F1 accuracy within fewer iterations and higher
ROI compared to Baseline (Random sampling based RF classifier). In both the
studies, estimate on, How much analysis likely would pay off for the invested
efforts?, was indicated by the break-even point. Conclusions: Decisions for the
depth and breadth of DA of empirical data should not be made solely based on
the accuracy measures. Since ROI-driven Data Analytics provides a simple yet
effective direction to discover when to stop further investigation while
considering the cost and value of the various types of analysis, it helps to
avoid over-analyzing empirical data.
Related papers
- Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - AROhI: An Interactive Tool for Estimating ROI of Data Analytics [0.0]
It is crucial to consider Return On Investment when performing data analytics.
This work details a comprehensive tool that provides conventional and advanced ML approaches for demonstration.
arXiv Detail & Related papers (2024-07-18T18:19:17Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm.
We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets.
We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Reinforced Approximate Exploratory Data Analysis [7.974685452145769]
We are first to consider the impact of sampling in interactive data exploration settings as they introduce approximation errors.
We propose a Deep Reinforcement Learning (DRL) based framework which can optimize the sample selection in order to keep the analysis and insight generation flow intact.
arXiv Detail & Related papers (2022-12-12T20:20:22Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - How Much Data Analytics is Enough? The ROI of Machine Learning
Classification and its Application to Requirements Dependency Classification [5.195942130196466]
Machine Learning can substantially improve the efficiency and effectiveness of organizations.
However, the selection and implementation of ML techniques rely almost exclusively on accuracy criteria.
We present findings for an approach that addresses this gap by enhancing the accuracy criterion with return on investment considerations.
arXiv Detail & Related papers (2021-09-28T23:27:57Z) - Provably Efficient Causal Reinforcement Learning with Confounded
Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting.
We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.