Related papers: Occams Razor for Big Data? On Detecting Quality in Large Unstructured Datasets

Occams Razor for Big Data? On Detecting Quality in Large Unstructured Datasets

URL: http://arxiv.org/abs/2011.08663v1
Date: Thu, 12 Nov 2020 16:06:01 GMT
Title: Occams Razor for Big Data? On Detecting Quality in Large Unstructured Datasets
Authors: Birgitta Dresp-Langley, Ole Kristian Ekseth, Jan Fesl, Seiichi Gohshi, Marc Kurz, Hans-Werner Sehring
Abstract summary: New trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science. Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time. The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting quality in large unstructured datasets requires capacities far beyond the limits of human perception and communicability and, as a result, there is an emerging trend towards increasingly complex analytic solutions in data science to cope with this problem. This new trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science. This review article combines insight from various domains such as physics, computational science, data engineering, and cognitive science to review the specific properties of big data. Problems for detecting data quality without losing the principle of parsimony are then highlighted on the basis of specific examples. Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time, and meaning can be extracted rapidly from large sets of unstructured image or video data parsimoniously through relatively simple unsupervised machine learning algorithms. Why we still massively lack in expertise for exploiting big data wisely to extract relevant information for specific tasks, recognize patterns, generate new information, or store and further process large amounts of sensor data is then reviewed; examples illustrating why we need subjective views and pragmatic methods to analyze big data contents are brought forward. The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics, and the development of increasingly autonomous artificial intelligence aimed at coping with the big data deluge in the near future.

Related papers

A systematic data characteristic understanding framework towards physical-sensor big data challenges [0.9672182825841383]
Recent advancements in sensor networks and the widespread adoption of IoT have led to the collection of physical-sensor data on an enormous scale. To uncover big data challenges and enhance data quality, it is essential to quantitatively unveil data characteristics. This paper proposes a systematic data characteristic framework based on a 6Vs model.
arXiv Detail & Related papers (2025-01-22T08:49:44Z)
Deep Learning, Machine Learning, Advancing Big Data Analytics and Management [26.911181864764117]
Advances in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies. It equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics.
arXiv Detail & Related papers (2024-12-03T05:59:34Z)
Big data searching using words [0.0]
We introduce some fundamental ideas related to the neighborhood structure of words in data searching. We also introduce big data primal in big data searching and discuss the application of neighborhood structures in detecting anomalies in data searching.
arXiv Detail & Related papers (2024-09-10T13:46:14Z)
Enhancing High-Energy Particle Physics Collision Analysis through Graph Data Attribution Techniques [0.0]
This paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline. By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples. By analyzing the discarded elements we can provide further insights about the event classification task.
arXiv Detail & Related papers (2024-07-20T12:40:03Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning. We first review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z)
Towards Generalizable Data Protection With Transferable Unlearnable Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples. To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z)
Anomaly detection using data depth: multivariate case [3.046315755726937]
Anomaly detection is a branch of data analysis and machine learning. Data depth is a statistical function that measures belongingness of any point of the space to a data set. This article studies data depth as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values.
arXiv Detail & Related papers (2022-10-06T12:14:25Z)
A Survey of Learning on Small Data: Generalization, Optimization, and Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI. This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data. Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Audacity of huge: overcoming challenges of data scarcity and data quality for machine learning in computational materials discovery [1.0036312061637764]
Machine learning (ML)-accelerated discovery requires large amounts of high-fidelity data to reveal predictive structure-property relationships. For many properties of interest in materials discovery, the challenging nature and high cost of data generation has resulted in a data landscape that is scarcely populated and of dubious quality. In the absence of manual curation, increasingly sophisticated natural language processing and automated image analysis are making it possible to learn structure-property relationships from the literature.
arXiv Detail & Related papers (2021-11-02T21:43:58Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Towards an Integrated Platform for Big Data Analysis [4.5257812998381315]
This paper presents the vision of an integrated plat-form for big data analysis that combines all these aspects. Main benefits of this approach are an enhanced scalability of the whole platform, a better parameterization of algorithms, and an improved usability during the end-to-end data analysis process.
arXiv Detail & Related papers (2020-04-27T03:15:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.