Random functions as data compressors for machine learning of molecular processes
- URL: http://arxiv.org/abs/2509.17937v1
- Date: Sun, 07 Sep 2025 11:45:27 GMT
- Title: Random functions as data compressors for machine learning of molecular processes
- Authors: Jayashrita Debnath, Gerhard Hummer,
- Abstract summary: We show that random nonlinear projections can be used to compress large feature spaces and make computations faster without substantial loss of information.<n>For our test cases NTL9 and the double-norleucin variant of the villin headpiece, we find that random compression retains the core static and dynamic information.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine learning (ML) is rapidly transforming the way molecular dynamics simulations are performed and analyzed, from materials modeling to studies of protein folding and function. ML algorithms are often employed to learn low-dimensional representations of conformational landscapes and to cluster trajectories into relevant metastable states. Most of these algorithms require selecting a small number of features that describe the problem of interest. Although deep neural networks can tackle large numbers of input features, the training costs increase with input size, which makes the selection of a subset of features mandatory for most problems of practical interest. Here, we show that random nonlinear projections can be used to compress large feature spaces and make computations faster without substantial loss of information. We describe an efficient way to produce random projections and then exemplify the general procedure for protein folding. For our test cases NTL9 and the double-norleucin variant of the villin headpiece, we find that random compression retains the core static and dynamic information of the original high dimensional feature space and makes trajectory analysis more robust.
Related papers
- Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions [12.410304632874531]
We prove that introducing a small random perturbation to the data distribution--via a random shift in the first moment--renders any Gaussian single index model as easy to learn as a linear function.<n>We extend this result to a class of multi index models, namely sparse Boolean functions, also known as Juntas.
arXiv Detail & Related papers (2025-02-10T13:19:30Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Gram-Schmidt Methods for Unsupervised Feature Extraction and Selection [7.373617024876725]
We propose a Gram-Schmidt process over function spaces to detect and map out nonlinear dependencies.<n>We provide experimental results for synthetic and real-world benchmark datasets.<n>Surprisingly, our linear feature extraction algorithms are comparable and often outperform several important nonlinear feature extraction methods.
arXiv Detail & Related papers (2023-11-15T21:29:57Z) - Representation Learning with Multi-Step Inverse Kinematics: An Efficient
and Optimal Approach to Rich-Observation RL [106.82295532402335]
Existing reinforcement learning algorithms suffer from computational intractability, strong statistical assumptions, and suboptimal sample complexity.
We provide the first computationally efficient algorithm that attains rate-optimal sample complexity with respect to the desired accuracy level.
Our algorithm, MusIK, combines systematic exploration with representation learning based on multi-step inverse kinematics.
arXiv Detail & Related papers (2023-04-12T14:51:47Z) - Score-based Diffusion Models in Function Space [137.70916238028306]
Diffusion models have recently emerged as a powerful framework for generative modeling.<n>This work introduces a mathematically rigorous framework called Denoising Diffusion Operators (DDOs) for training diffusion models in function space.<n>We show that the corresponding discretized algorithm generates accurate samples at a fixed cost independent of the data resolution.
arXiv Detail & Related papers (2023-02-14T23:50:53Z) - Adaptive Machine Learning for Time-Varying Systems: Low Dimensional
Latent Space Tuning [91.3755431537592]
We present a recently developed method of adaptive machine learning for time-varying systems.
Our approach is to map very high (N>100k) dimensional inputs into the low dimensional (N2) latent space at the output of the encoder section of an encoder-decoder CNN.
This method allows us to learn correlations within and to track their evolution in real time based on feedback without interrupts.
arXiv Detail & Related papers (2021-07-13T16:05:28Z) - Scalable Gaussian Processes for Data-Driven Design using Big Data with
Categorical Factors [14.337297795182181]
Gaussian processes (GP) have difficulties in accommodating big datasets, categorical inputs, and multiple responses.
We propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously.
Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism.
arXiv Detail & Related papers (2021-06-26T02:17:23Z) - Time Varying Particle Data Feature Extraction and Tracking with Neural
Networks [20.825102707056647]
We take a deep learning approach to create feature representations for scientific particle data to assist feature extraction and tracking.
We employ a deep learning model, which produces latent vectors to represent the relation between spatial locations and physical attributes in a local neighborhood.
To achieve fast feature tracking, the mean-shift tracking algorithm is applied in the feature space.
arXiv Detail & Related papers (2021-05-27T15:38:14Z) - Adaptive Latent Space Tuning for Non-Stationary Distributions [62.997667081978825]
We present a method for adaptive tuning of the low-dimensional latent space of deep encoder-decoder style CNNs.
We demonstrate our approach for predicting the properties of a time-varying charged particle beam in a particle accelerator.
arXiv Detail & Related papers (2021-05-08T03:50:45Z) - Quantum Algorithms for Data Representation and Analysis [68.754953879193]
We provide quantum procedures that speed-up the solution of eigenproblems for data representation in machine learning.
The power and practical use of these subroutines is shown through new quantum algorithms, sublinear in the input matrix's size, for principal component analysis, correspondence analysis, and latent semantic analysis.
Results show that the run-time parameters that do not depend on the input's size are reasonable and that the error on the computed model is small, allowing for competitive classification performances.
arXiv Detail & Related papers (2021-04-19T00:41:43Z) - Random Sampling High Dimensional Model Representation Gaussian Process
Regression (RS-HDMR-GPR) for representing multidimensional functions with
machine-learned lower-dimensional terms allowing insight with a general
method [0.0]
Python implementation for RS-HDMR-GPR (Random Sampling High Dimensional Model Representation Gaussian Process Regression)
Code allows for imputation of missing values of the variables and for a significant pruning of the useful number of HDMR terms.
The capabilities of this regression tool are demonstrated on test cases involving synthetic analytic functions, the potential energy surface of the water molecule, kinetic energy densities of materials, and financial market data.
arXiv Detail & Related papers (2020-11-24T00:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.