Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance
- URL: http://arxiv.org/abs/2411.00851v2
- Date: Mon, 30 Dec 2024 15:38:52 GMT
- Title: Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance
- Authors: Romina Wild, Felix Wodaczek, Vittorio Del Tatto, Bingqing Cheng, Alessandro Laio,
- Abstract summary: Differentiable Information Imbalance (DII) is an automated method to rank information content between sets of features.
Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships.
DII can produce sparse solutions and determine the optimal size of the reduced feature space.
- Score: 41.452380773977154
- License:
- Abstract: Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.
Related papers
- Gram-Schmidt Methods for Unsupervised Feature Extraction and Selection [7.373617024876725]
We propose a Gram-Schmidt process over function spaces to detect and map out nonlinear dependencies.
We provide experimental results for synthetic and real-world benchmark datasets.
Surprisingly, our linear feature extraction algorithms are comparable and often outperform several important nonlinear feature extraction methods.
arXiv Detail & Related papers (2023-11-15T21:29:57Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - Measuring dissimilarity with diffeomorphism invariance [94.02751799024684]
We introduce DID, a pairwise dissimilarity measure applicable to a wide range of data spaces.
We prove that DID enjoys properties which make it relevant for theoretical study and practical use.
arXiv Detail & Related papers (2022-02-11T13:51:30Z) - Feature Weighted Non-negative Matrix Factorization [92.45013716097753]
We propose the Feature weighted Non-negative Matrix Factorization (FNMF) in this paper.
FNMF learns the weights of features adaptively according to their importances.
It can be solved efficiently with the suggested optimization algorithm.
arXiv Detail & Related papers (2021-03-24T21:17:17Z) - Feature Selection Using Reinforcement Learning [0.0]
The space of variables or features that can be used to characterize a particular predictor of interest continues to grow exponentially.
Identifying the most characterizing features that minimizes the variance without jeopardizing the bias of our models is critical to successfully training a machine learning model.
arXiv Detail & Related papers (2021-01-23T09:24:37Z) - Robust Multi-class Feature Selection via $l_{2,0}$-Norm Regularization
Minimization [6.41804410246642]
Feature selection is an important computational-processing in data mining and machine learning.
In this paper, a novel method based on homoy hard threshold (HIHT) is proposed to solve the least square problem for multi-class feature selection.
arXiv Detail & Related papers (2020-10-08T02:06:06Z) - The role of feature space in atomistic learning [62.997667081978825]
Physically-inspired descriptors play a key role in the application of machine-learning techniques to atomistic simulations.
We introduce a framework to compare different sets of descriptors, and different ways of transforming them by means of metrics and kernels.
We compare representations built in terms of n-body correlations of the atom density, quantitatively assessing the information loss associated with the use of low-order features.
arXiv Detail & Related papers (2020-09-06T14:12:09Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - METASET: Exploring Shape and Property Spaces for Data-Driven
Metamaterials Design [20.272835126269374]
We show that a smaller yet diverse set of unit cells leads to scalable search and unbiased learning.
Our flexible method can distill unique subsets regardless of the metric employed.
Our diverse subsets are provided publicly for use by any designer.
arXiv Detail & Related papers (2020-06-01T03:36:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.