Unsupervised detection of semantic correlations in big data
- URL: http://arxiv.org/abs/2411.02126v1
- Date: Mon, 04 Nov 2024 14:37:07 GMT
- Title: Unsupervised detection of semantic correlations in big data
- Authors: Santiago Acevedo, Alex Rodriguez, Alessandro Laio,
- Abstract summary: We present a method to detect semantic correlations in high-dimensional data represented as binary numbers.
We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data.
The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis.
- Score: 47.201377047286215
- License:
- Abstract: In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.
Related papers
- How compositional generalization and creativity improve as diffusion models are trained [82.08869888944324]
How many samples do generative models need to learn the composition rules, so as to produce a number of novel data?
We consider diffusion models trained on simple context-free grammars - tree-like graphical models used to represent the structure of data such as language and images.
We demonstrate that diffusion models learn compositional rules with the sample complexity required for clustering features with statistically similar context, a process similar to the word2vec.
arXiv Detail & Related papers (2025-02-17T18:06:33Z) - Explaining Categorical Feature Interactions Using Graph Covariance and LLMs [18.44675735926458]
This paper focuses on the global synthetic dataset from the Counter Trafficking Data Collaborative.
It contains over 200,000 anonymized records spanning from 2002 to 2022 with numerous categorical features for each record.
We propose a fast and scalable method for analyzing and extracting significant categorical feature interactions.
arXiv Detail & Related papers (2025-01-24T21:41:26Z) - Decomposing neural networks as mappings of correlation functions [57.52754806616669]
We study the mapping between probability distributions implemented by a deep feed-forward network.
We identify essential statistics in the data, as well as different information representations that can be used by neural networks.
arXiv Detail & Related papers (2022-02-10T09:30:31Z) - Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs.
By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z) - Graph Neural Network-Based Anomaly Detection in Multivariate Time Series [17.414474298706416]
We develop a new way to detect anomalies in high-dimensional time series data.
Our approach combines a structure learning approach with graph neural networks.
We show that our method detects anomalies more accurately than baseline approaches.
arXiv Detail & Related papers (2021-06-13T09:07:30Z) - Learning Optical Flow from a Few Matches [67.83633948984954]
We show that the dense correlation volume representation is redundant and accurate flow estimation can be achieved with only a fraction of elements in it.
Experiments show that our method can reduce computational cost and memory use significantly, while maintaining high accuracy.
arXiv Detail & Related papers (2021-04-05T21:44:00Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z) - Neural Networks and Polynomial Regression. Demystifying the
Overparametrization Phenomena [17.205106391379026]
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data.
A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data.
We show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension.
arXiv Detail & Related papers (2020-03-23T20:09:31Z) - Correlation-aware Deep Generative Model for Unsupervised Anomaly
Detection [9.578395294627057]
Unsupervised anomaly detection aims to identify anomalous samples from highly complex and unstructured data.
We propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM)
Experiments on real-world datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-02-18T03:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.