Related papers: Unsupervised detection of semantic correlations in big data

Unsupervised detection of semantic correlations in big data

URL: http://arxiv.org/abs/2411.02126v1
Date: Mon, 04 Nov 2024 14:37:07 GMT
Title: Unsupervised detection of semantic correlations in big data
Authors: Santiago Acevedo, Alex Rodriguez, Alessandro Laio,
Abstract summary: We present a method to detect semantic correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis.
Score: 47.201377047286215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In real-world data, information is stored in extremely large feature vectors. These variables are typically correlated due to complex interactions involving many features simultaneously. Such correlations qualitatively correspond to semantic roles and are naturally recognized by both the human brain and artificial neural networks. This recognition enables, for instance, the prediction of missing parts of an image or text based on their context. We present a method to detect these correlations in high-dimensional data represented as binary numbers. We estimate the binary intrinsic dimension of a dataset, which quantifies the minimum number of independent coordinates needed to describe the data, and is therefore a proxy of semantic complexity. The proposed algorithm is largely insensitive to the so-called curse of dimensionality, and can therefore be used in big data analysis. We test this approach identifying phase transitions in model magnetic systems and we then apply it to the detection of semantic correlations of images and text inside deep neural networks.

Related papers

Explaining Categorical Feature Interactions Using Graph Covariance and LLMs [18.44675735926458]
This paper focuses on the global synthetic dataset from the Counter Trafficking Data Collaborative. It contains over 200,000 anonymized records spanning from 2002 to 2022 with numerous categorical features for each record. We propose a fast and scalable method for analyzing and extracting significant categorical feature interactions.
arXiv Detail & Related papers (2025-01-24T21:41:26Z)
Memorization with neural nets: going beyond the worst case [5.662924503089369]
In practice, deep neural networks are often able to easily interpolate their training data. For real-world data, however, one intuitively expects the presence of a benign structure so that already occurs at a smaller network size than suggested by memorization capacity. We introduce a simple randomized algorithm that, given a fixed finite dataset with two classes, with high probability constructs an interpolating three-layer neural network in time.
arXiv Detail & Related papers (2023-09-30T10:06:05Z)
Decomposing neural networks as mappings of correlation functions [57.52754806616669]
We study the mapping between probability distributions implemented by a deep feed-forward network. We identify essential statistics in the data, as well as different information representations that can be used by neural networks.
arXiv Detail & Related papers (2022-02-10T09:30:31Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Graph Neural Network-Based Anomaly Detection in Multivariate Time Series [17.414474298706416]
We develop a new way to detect anomalies in high-dimensional time series data. Our approach combines a structure learning approach with graph neural networks. We show that our method detects anomalies more accurately than baseline approaches.
arXiv Detail & Related papers (2021-06-13T09:07:30Z)
Learning Optical Flow from a Few Matches [67.83633948984954]
We show that the dense correlation volume representation is redundant and accurate flow estimation can be achieved with only a fraction of elements in it. Experiments show that our method can reduce computational cost and memory use significantly, while maintaining high accuracy.
arXiv Detail & Related papers (2021-04-05T21:44:00Z)
Anomaly Detection on Attributed Networks via Contrastive Self-Supervised Learning [50.24174211654775]
We present a novel contrastive self-supervised learning framework for anomaly detection on attributed networks. Our framework fully exploits the local information from network data by sampling a novel type of contrastive instance pair. A graph neural network-based contrastive learning model is proposed to learn informative embedding from high-dimensional attributes and local structure.
arXiv Detail & Related papers (2021-02-27T03:17:20Z)
Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning. We propose a novel method of using data augmentations when training autoencoders. We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z)
Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena [17.205106391379026]
In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. We show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension.
arXiv Detail & Related papers (2020-03-23T20:09:31Z)
Correlation-aware Deep Generative Model for Unsupervised Anomaly Detection [9.578395294627057]
Unsupervised anomaly detection aims to identify anomalous samples from highly complex and unstructured data. We propose a method of Correlation aware unsupervised Anomaly detection via Deep Gaussian Mixture Model (CADGMM) Experiments on real-world datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-02-18T03:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.