Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties
- URL: http://arxiv.org/abs/2206.08841v1
- Date: Fri, 17 Jun 2022 15:39:39 GMT
- Title: Random projections and Kernelised Leave One Cluster Out
Cross-Validation: Universal baselines and evaluation tools for supervised
machine learning for materials properties
- Authors: Samantha Durdy, Michael Gaultois, Vladimir Gusev, Danushka Bollegala
and Matthew J. Rosseinsky
- Abstract summary: leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials.
We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception.
- Score: 10.962094053749093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With machine learning being a popular topic in current computational
materials science literature, creating representations for compounds has become
common place. These representations are rarely compared, as evaluating their
performance - and the performance of the algorithms that they are used with -
is non-trivial. With many materials datasets containing bias and skew caused by
the research process, leave one cluster out cross validation (LOCO-CV) has been
introduced as a way of measuring the performance of an algorithm in predicting
previously unseen groups of materials. This raises the question of the impact,
and control, of the range of cluster sizes on the LOCO-CV measurement outcomes.
We present a thorough comparison between composition-based representations, and
investigate how kernel approximation functions can be used to better separate
data to enhance LOCO-CV applications.
We find that domain knowledge does not improve machine learning performance
in most tasks tested, with band gap prediction being the notable exception. We
also find that the radial basis function improves the linear separability of
chemical datasets in all 10 datasets tested and provide a framework for the
application of this function in the LOCO-CV process to improve the outcome of
LOCO-CV measurements regardless of machine learning algorithm, choice of
metric, and choice of compound representation. We recommend kernelised LOCO-CV
as a training paradigm for those looking to measure the extrapolatory power of
an algorithm on materials data.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Composable Core-sets for Diversity Approximation on Multi-Dataset
Streams [4.765131728094872]
Composable core-sets are core-sets with the property that subsets of the core set can be unioned together to obtain an approximation for the original data.
We introduce a core-set construction algorithm for constructing composable core-sets to summarize streamed data for use in active learning environments.
arXiv Detail & Related papers (2023-08-10T23:24:51Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Learning Operators with Coupled Attention [9.715465024071333]
We propose a novel operator learning method, LOCA, motivated from the recent success of the attention mechanism.
In our architecture the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations.
By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions.
arXiv Detail & Related papers (2022-01-04T08:22:03Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - DAC: Deep Autoencoder-based Clustering, a General Deep Learning
Framework of Representation Learning [0.0]
We propose DAC, Deep Autoencoder-based Clustering, a data-driven framework to learn clustering representations using deep neuron networks.
Experiment results show that our approach could effectively boost performance of the KMeans clustering algorithm on a variety of datasets.
arXiv Detail & Related papers (2021-02-15T11:31:00Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.