Related papers: The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets

URL: http://arxiv.org/abs/2306.14975v3
Date: Fri, 5 Apr 2024 10:45:19 GMT
Title: The Underlying Scaling Laws and Universal Statistical Structure of Complex Datasets
Authors: Noam Levi, Yaron Oz,
Abstract summary: We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure.
Score: 2.07180164747172
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study universal traits which emerge both in real-world complex datasets, as well as in artificially generated ones. Our approach is to analogize data to a physical system and employ tools from statistical physics and Random Matrix Theory (RMT) to reveal their underlying structure. We focus on the feature-feature covariance matrix, analyzing both its local and global eigenvalue statistics. Our main observations are: (i) The power-law scalings that the bulk of its eigenvalues exhibit are vastly different for uncorrelated normally distributed data compared to real-world data, (ii) this scaling behavior can be completely modeled by generating Gaussian data with long range correlations, (iii) both generated and real-world datasets lie in the same universality class from the RMT perspective, as chaotic rather than integrable systems, (iv) the expected RMT statistical behavior already manifests for empirical covariance matrices at dataset sizes significantly smaller than those conventionally used for real-world training, and can be related to the number of samples required to approximate the population power-law scaling behavior, (v) the Shannon entropy is correlated with local RMT structure and eigenvalues scaling, is substantially smaller in strongly correlated datasets compared to uncorrelated ones, and requires fewer samples to reach the distribution entropy. These findings show that with sufficient sample size, the Gram matrix of natural image datasets can be well approximated by a Wishart random matrix with a simple covariance structure, opening the door to rigorous studies of neural network dynamics and generalization which rely on the data Gram matrix.

Related papers

Summary Statistics of Large-scale Model Outputs for Observation-corrected Outputs [0.0]
We propose Sig-PCA, a space-time framework that integrates summary statistics from model outputs with localized observations via a neural network (NN)<n>This framework highlights the synergy between observational data and statistical summaries of model outputs, and effectively combines multisource data by preserving essential statistical information.
arXiv Detail & Related papers (2025-06-18T19:49:56Z)
Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
Landscape Complexity for the Empirical Risk of Generalized Linear Models: Discrimination between Structured Data [2.486161976966064]
We use the Kac-Rice formula and results from random matrix theory to obtain the average number of critical points of a family of high-dimensional empirical loss functions. The correlations are introduced to model the existence of structure in the data, as is common in current Machine-Learning systems. For completeness, we also treat the case of a loss function used in training Generalized Linear Models in the presence of correlated input data.
arXiv Detail & Related papers (2025-03-18T16:44:33Z)
Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks. We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z)
Learning Massive-scale Partial Correlation Networks in Clinical Multi-omics Studies with HP-ACCORD [10.459304300065186]
We introduce a novel pseudolikelihood-based graphical model framework. It maintains estimation and selection consistency in various metrics under high-dimensional assumptions. A high-performance computing implementation of our framework was tested in simulated data with up to one million variables.
arXiv Detail & Related papers (2024-12-16T08:38:02Z)
Learning with Shared Representations: Statistical Rates and Efficient Algorithms [13.643155483461028]
Collaborative learning through latent shared representations enables heterogeneous clients to train personalized models with enhanced performance while reducing sample size. Despite its empirical success and extensive research, the theoretical understanding of statistical error rates remains incomplete, even for shared representations constrained to low-dimensional linear subspaces.
arXiv Detail & Related papers (2024-09-07T21:53:01Z)
Learning Divergence Fields for Shift-Robust Graph Representations [73.11818515795761]
In this work, we propose a geometric diffusion model with learnable divergence fields for the challenging problem with interdependent data. We derive a new learning objective through causal inference, which can guide the model to learn generalizable patterns of interdependence that are insensitive across domains.
arXiv Detail & Related papers (2024-06-07T14:29:21Z)
DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z)
coVariance Neural Networks [119.45320143101381]
Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. We propose a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We show that VNN performance is indeed more stable than PCA-based statistical approaches.
arXiv Detail & Related papers (2022-05-31T15:04:43Z)
Scalable Regularised Joint Mixture Models [2.0686407686198263]
In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions. We propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels. The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large.
arXiv Detail & Related papers (2022-05-03T13:38:58Z)
Optimal regularizations for data generation with probabilistic graphical models [0.0]
Empirically, well-chosen regularization schemes dramatically improve the quality of the inferred models. We consider the particular case of L 2 and L 1 regularizations in the Maximum A Posteriori (MAP) inference of generative pairwise graphical models.
arXiv Detail & Related papers (2021-12-02T14:45:16Z)
Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion. We propose a general framework that combines disparate data types through the exponential family of distributions. The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z)
CNN-based Realized Covariance Matrix Forecasting [0.0]
We propose an end-to-end trainable model built on the CNN and Conal LSTM (ConvLSTM) It focuses on local structures and correlations and learns a nonlinear mapping that connect the historical realized covariance matrices to the future one. Our empirical studies on synthetic and real-world datasets demonstrate its excellent forecasting ability compared with several advanced volatility models.
arXiv Detail & Related papers (2021-07-22T12:02:24Z)
Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models. We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data. We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
Entropy Minimizing Matrix Factorization [102.26446204624885]
Nonnegative Matrix Factorization (NMF) is a widely-used data analysis technique, and has yielded impressive results in many real-world tasks. In this study, an Entropy Minimizing Matrix Factorization framework (EMMF) is developed to tackle the above problem. Considering that the outliers are usually much less than the normal samples, a new entropy loss function is established for matrix factorization.
arXiv Detail & Related papers (2021-03-24T21:08:43Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
Meta-analysis of heterogeneous data: integrative sparse regression in high-dimensions [21.162280861396205]
We consider the task of meta-analysis in high-dimensional settings in which the data sources are similar but non-identical. We introduce a global parameter that emphasizes interpretability and statistical efficiency in the presence of heterogeneity. We demonstrate the benefits of our approach on a large-scale drug treatment dataset involving several different cancer cell-lines.
arXiv Detail & Related papers (2019-12-26T20:30:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.