Occams Razor for Big Data? On Detecting Quality in Large Unstructured
  Datasets
        - URL: http://arxiv.org/abs/2011.08663v1
- Date: Thu, 12 Nov 2020 16:06:01 GMT
- Title: Occams Razor for Big Data? On Detecting Quality in Large Unstructured
  Datasets
- Authors: Birgitta Dresp-Langley, Ole Kristian Ekseth, Jan Fesl, Seiichi Gohshi,
  Marc Kurz, Hans-Werner Sehring
- Abstract summary: New trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science.
 Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time.
The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Detecting quality in large unstructured datasets requires capacities far
beyond the limits of human perception and communicability and, as a result,
there is an emerging trend towards increasingly complex analytic solutions in
data science to cope with this problem. This new trend towards analytic
complexity represents a severe challenge for the principle of parsimony or
Occams Razor in science. This review article combines insight from various
domains such as physics, computational science, data engineering, and cognitive
science to review the specific properties of big data. Problems for detecting
data quality without losing the principle of parsimony are then highlighted on
the basis of specific examples. Computational building block approaches for
data clustering can help to deal with large unstructured datasets in minimized
computation time, and meaning can be extracted rapidly from large sets of
unstructured image or video data parsimoniously through relatively simple
unsupervised machine learning algorithms. Why we still massively lack in
expertise for exploiting big data wisely to extract relevant information for
specific tasks, recognize patterns, generate new information, or store and
further process large amounts of sensor data is then reviewed; examples
illustrating why we need subjective views and pragmatic methods to analyze big
data contents are brought forward. The review concludes on how cultural
differences between East and West are likely to affect the course of big data
analytics, and the development of increasingly autonomous artificial
intelligence aimed at coping with the big data deluge in the near future.
 
      
        Related papers
        - A systematic data characteristic understanding framework towards   physical-sensor big data challenges [0.9672182825841383]
 Recent advancements in sensor networks and the widespread adoption of IoT have led to the collection of physical-sensor data on an enormous scale.
To uncover big data challenges and enhance data quality, it is essential to quantitatively unveil data characteristics.
This paper proposes a systematic data characteristic framework based on a 6Vs model.
 arXiv  Detail & Related papers  (2025-01-22T08:49:44Z)
- Deep Learning, Machine Learning, Advancing Big Data Analytics and   Management [26.911181864764117]
 Advances in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management.
This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies.
It equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics.
 arXiv  Detail & Related papers  (2024-12-03T05:59:34Z)
- Big data searching using words [0.0]
 We introduce some fundamental ideas related to the neighborhood structure of words in data searching.
We also introduce big data primal in big data searching and discuss the application of neighborhood structures in detecting anomalies in data searching.
 arXiv  Detail & Related papers  (2024-09-10T13:46:14Z)
- Enhancing High-Energy Particle Physics Collision Analysis through Graph   Data Attribution Techniques [0.0]
 This paper uses a simulated particle collision dataset to integrate influence analysis inside the graph classification pipeline.
By using a Graph Neural Network for initial training, we applied a gradient-based data influence method to identify influential training samples.
By analyzing the discarded elements we can provide further insights about the event classification task.
 arXiv  Detail & Related papers  (2024-07-20T12:40:03Z)
- On Responsible Machine Learning Datasets with Fairness, Privacy, and   Regulatory Norms [56.119374302685934]
 There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
 arXiv  Detail & Related papers  (2023-10-24T14:01:53Z)
- Privacy-Preserving Graph Machine Learning from Data to Computation: A
  Survey [67.7834898542701]
 We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
 arXiv  Detail & Related papers  (2023-07-10T04:30:23Z)
- Towards Generalizable Data Protection With Transferable Unlearnable
  Examples [50.628011208660645]
 We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
 arXiv  Detail & Related papers  (2023-05-18T04:17:01Z)
- Anomaly detection using data depth: multivariate case [3.046315755726937]
 Anomaly detection is a branch of data analysis and machine learning.
Data depth is a statistical function that measures belongingness of any point of the space to a data set.
This article studies data depth as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values.
 arXiv  Detail & Related papers  (2022-10-06T12:14:25Z)
- A Survey of Learning on Small Data: Generalization, Optimization, and
  Challenge [101.27154181792567]
 Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
 arXiv  Detail & Related papers  (2022-07-29T02:34:19Z)
- Amortized Inference for Causal Structure Learning [72.84105256353801]
 Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
 arXiv  Detail & Related papers  (2022-05-25T17:37:08Z)
- Audacity of huge: overcoming challenges of data scarcity and data
  quality for machine learning in computational materials discovery [1.0036312061637764]
 Machine learning (ML)-accelerated discovery requires large amounts of high-fidelity data to reveal predictive structure-property relationships.
For many properties of interest in materials discovery, the challenging nature and high cost of data generation has resulted in a data landscape that is scarcely populated and of dubious quality.
In the absence of manual curation, increasingly sophisticated natural language processing and automated image analysis are making it possible to learn structure-property relationships from the literature.
 arXiv  Detail & Related papers  (2021-11-02T21:43:58Z)
- Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
 We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
 arXiv  Detail & Related papers  (2021-04-17T21:34:10Z)
- Towards an Integrated Platform for Big Data Analysis [4.5257812998381315]
 This paper presents the vision of an integrated plat-form for big data analysis that combines all these aspects.
Main benefits of this approach are an enhanced scalability of the whole platform, a better parameterization of algorithms, and an improved usability during the end-to-end data analysis process.
 arXiv  Detail & Related papers  (2020-04-27T03:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.