Occams Razor for Big Data? On Detecting Quality in Large Unstructured
Datasets
- URL: http://arxiv.org/abs/2011.08663v1
- Date: Thu, 12 Nov 2020 16:06:01 GMT
- Title: Occams Razor for Big Data? On Detecting Quality in Large Unstructured
Datasets
- Authors: Birgitta Dresp-Langley, Ole Kristian Ekseth, Jan Fesl, Seiichi Gohshi,
Marc Kurz, Hans-Werner Sehring
- Abstract summary: New trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science.
Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time.
The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting quality in large unstructured datasets requires capacities far
beyond the limits of human perception and communicability and, as a result,
there is an emerging trend towards increasingly complex analytic solutions in
data science to cope with this problem. This new trend towards analytic
complexity represents a severe challenge for the principle of parsimony or
Occams Razor in science. This review article combines insight from various
domains such as physics, computational science, data engineering, and cognitive
science to review the specific properties of big data. Problems for detecting
data quality without losing the principle of parsimony are then highlighted on
the basis of specific examples. Computational building block approaches for
data clustering can help to deal with large unstructured datasets in minimized
computation time, and meaning can be extracted rapidly from large sets of
unstructured image or video data parsimoniously through relatively simple
unsupervised machine learning algorithms. Why we still massively lack in
expertise for exploiting big data wisely to extract relevant information for
specific tasks, recognize patterns, generate new information, or store and
further process large amounts of sensor data is then reviewed; examples
illustrating why we need subjective views and pragmatic methods to analyze big
data contents are brought forward. The review concludes on how cultural
differences between East and West are likely to affect the course of big data
analytics, and the development of increasingly autonomous artificial
intelligence aimed at coping with the big data deluge in the near future.
Related papers
- Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques [0.0]
This paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline.
By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples.
By analyzing the discarded elements we can provide further insights about the event classification task.
arXiv Detail & Related papers (2024-07-20T12:40:03Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - Towards Generalizable Data Protection With Transferable Unlearnable
Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z) - A Vision for Semantically Enriched Data Science [19.604667287258724]
Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation.
We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2023-03-02T16:03:12Z) - Anomaly detection using data depth: multivariate case [3.046315755726937]
Anomaly detection is a branch of data analysis and machine learning.
Data depth is a statistical function that measures belongingness of any point of the space to a data set.
This article studies data depth as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values.
arXiv Detail & Related papers (2022-10-06T12:14:25Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Audacity of huge: overcoming challenges of data scarcity and data
quality for machine learning in computational materials discovery [1.0036312061637764]
Machine learning (ML)-accelerated discovery requires large amounts of high-fidelity data to reveal predictive structure-property relationships.
For many properties of interest in materials discovery, the challenging nature and high cost of data generation has resulted in a data landscape that is scarcely populated and of dubious quality.
In the absence of manual curation, increasingly sophisticated natural language processing and automated image analysis are making it possible to learn structure-property relationships from the literature.
arXiv Detail & Related papers (2021-11-02T21:43:58Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Towards an Integrated Platform for Big Data Analysis [4.5257812998381315]
This paper presents the vision of an integrated plat-form for big data analysis that combines all these aspects.
Main benefits of this approach are an enhanced scalability of the whole platform, a better parameterization of algorithms, and an improved usability during the end-to-end data analysis process.
arXiv Detail & Related papers (2020-04-27T03:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.