Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints
- URL: http://arxiv.org/abs/2403.17954v1
- Date: Sun, 10 Mar 2024 16:49:04 GMT
- Title: Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints
- Authors: Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris,
- Abstract summary: We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling.
Sort & Slice is an easy-to-implement and bit-free alternative to hash-based folding for the pooling of ECFP substructures.
- Score: 0.873811641236639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.
Related papers
- Accelerating spherical K-means clustering for large-scale sparse document data [0.7366405857677226]
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets.
We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.
arXiv Detail & Related papers (2024-11-18T05:50:58Z) - Faster Learned Sparse Retrieval with Block-Max Pruning [11.080810272211906]
This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments.
BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts.
arXiv Detail & Related papers (2024-05-02T09:26:30Z) - Compact Neural Graphics Primitives with Learned Hash Probing [100.07267906666293]
We show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed.
Inference is faster than unprobed hash tables at equal quality while training is only 1.2-2.6x slower.
arXiv Detail & Related papers (2023-12-28T18:58:45Z) - TRUST: An Accurate and End-to-End Table structure Recognizer Using
Splitting-based Transformers [56.56591337457137]
We propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST.
Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation.
We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-08-31T08:33:36Z) - Multi-Modal Association based Grouping for Form Structure Extraction [14.134131448981295]
We present a novel multi-modal approach for form structure extraction.
We extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups.
Our approach achieves a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively.
arXiv Detail & Related papers (2021-07-09T12:49:34Z) - Dynamic Probabilistic Pruning: A general framework for
hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps)
We refer to this algorithm as Dynamic Probabilistic Pruning (DPP)
We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z) - HPNet: Deep Primitive Segmentation Using Hybrid Representations [51.56523135057311]
HPNet is a novel deep-learning approach for segmenting a 3D shape represented as a point cloud into primitive patches.
Unlike utilizing a single feature representation, HPNet hybrid representations that combine one learned semantic descriptor, two spectral descriptors derived from predicted parameters, as well as an adjacency matrix that encodes sharp edges.
arXiv Detail & Related papers (2021-05-22T02:12:46Z) - Partitioned hybrid learning of Bayesian network structures [6.683105697884667]
We develop a novel hybrid method for Bayesian network structure learning called partitioned hybrid greedy search (pHGS)
Our empirical results demonstrate the superior empirical performance of pHGS against many state-of-the-art structure learning algorithms.
arXiv Detail & Related papers (2021-03-22T21:34:52Z) - Oblique Predictive Clustering Trees [6.317966126631351]
Predictive clustering trees (PCTs) can be used to solve a variety of predictive modeling tasks, including structured output prediction.
We propose oblique predictive clustering trees, capable of addressing these limitations.
We experimentally evaluate the proposed methods on 60 benchmark datasets for 6 predictive modeling tasks.
arXiv Detail & Related papers (2020-07-27T14:58:23Z) - Torch-Struct: Deep Structured Prediction Library [138.5262350501951]
We introduce Torch-Struct, a library for structured prediction.
Torch-Struct includes a broad collection of probabilistic structures accessed through a simple and flexible distribution-based API.
arXiv Detail & Related papers (2020-02-03T16:43:02Z) - Supervised Learning for Non-Sequential Data: A Canonical Polyadic
Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks.
To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor.
For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.