Related papers: Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints

URL: http://arxiv.org/abs/2403.17954v1
Date: Sun, 10 Mar 2024 16:49:04 GMT
Title: Sort & Slice: A Simple and Superior Alternative to Hash-Based Folding for Extended-Connectivity Fingerprints
Authors: Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris,
Abstract summary: We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling. Sort & Slice is an easy-to-implement and bit-free alternative to hash-based folding for the pooling of ECFP substructures.
Score: 0.873811641236639
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods; in contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure-selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the $L$ most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, $L$. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated methods across prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.

Related papers

PRISM: Parallel Residual Iterative Sequence Model [52.26239951489612]
We propose PRISM (Parallel Residual Iterative Sequence Model) to resolve this tension.<n>PRISM introduces a solver-inspired inductive bias that captures key structural properties of multi-step refinement in a parallelizable form.<n>We prove that this formulation achieves Rank-$L$ accumulation, structurally expanding the update manifold beyond the single-step Rank-$1$ bottleneck.
arXiv Detail & Related papers (2026-02-11T12:39:41Z)
Learning Discrete Bayesian Networks with Hierarchical Dirichlet Shrinkage [52.914168158222765]
We detail a comprehensive Bayesian framework for learning DBNs.<n>We give a novel Markov chain Monte Carlo (MCMC) algorithm utilizing parallel Langevin proposals to generate exact posterior samples.<n>We apply our methodology to uncover prognostic network structure from primary breast cancer samples.
arXiv Detail & Related papers (2025-09-16T17:24:35Z)
Piecewise Linear Approximation in Learned Index Structures: Theoretical and Empirical Analysis [16.350750984598797]
Piecewise Linear Approximation ($epsilon$-PLA) has emerged as a popular choice due to its simplicity and effectiveness.<n>Despite its central role in many learned indexes, the design and analysis of $epsilon$-PLA fitting algorithms remain underexplored.
arXiv Detail & Related papers (2025-06-25T05:20:54Z)
Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table [60.78615287040791]
XAStruct is a learning framework capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input.<n>XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table.
arXiv Detail & Related papers (2025-06-13T15:58:05Z)
Scalable Substructure Discovery Algorithm For Homogeneous Multilayer Networks [2.941253902145271]
Graph mining analyzes real-world graphs to find core substructures (connected subgraphs) in applications modeled as graphs. Substructure discovery is a process that involves identifying meaningful patterns, structures, or components within a large data set. This paper focuses on substructure discovery in homogeneous multilayer networks (one type of MLN) using a novel decoupling-based approach.
arXiv Detail & Related papers (2025-04-27T18:58:32Z)
Learning Structure-enhanced Temporal Point Processes with Gromov-Wasserstein Regularization [31.23290588877332]
We learn structure-enhanced TPPs with the help of Gromov-Wasserstein (GW) regularization. In large-scale applications, we sample the kernel matrix and implement the regularization as a Gromov-Wasserstein (GW) discrepancy term. The TPPs learned through this method result in clustered sequence embeddings and demonstrate competitive predictive and clustering performance.
arXiv Detail & Related papers (2025-03-29T07:47:21Z)
Accelerating spherical K-means clustering for large-scale sparse document data [0.7366405857677226]
This paper presents an accelerated spherical K-means clustering algorithm for large-scale and high-dimensional sparse document data sets. We experimentally demonstrate that our algorithm efficiently achieves superior speed performance in large-scale documents compared with algorithms using the state-of-the-art techniques.
arXiv Detail & Related papers (2024-11-18T05:50:58Z)
Faster Learned Sparse Retrieval with Block-Max Pruning [11.080810272211906]
This paper introduces Block-Max Pruning (BMP), an innovative dynamic pruning strategy tailored for indexes arising in learned sparse retrieval environments. BMP substantially outperforms existing dynamic pruning strategies, offering unparalleled efficiency in safe retrieval contexts.
arXiv Detail & Related papers (2024-05-02T09:26:30Z)
Compact Neural Graphics Primitives with Learned Hash Probing [100.07267906666293]
We show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed. Inference is faster than unprobed hash tables at equal quality while training is only 1.2-2.6x slower.
arXiv Detail & Related papers (2023-12-28T18:58:45Z)
TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers [56.56591337457137]
We propose an accurate and end-to-end transformer-based table structure recognition method, referred to as TRUST. Transformers are suitable for table structure recognition because of their global computations, perfect memory, and parallel computation. We conduct experiments on several popular benchmarks including PubTabNet and SynthTable, our method achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-08-31T08:33:36Z)
Multi-Modal Association based Grouping for Form Structure Extraction [14.134131448981295]
We present a novel multi-modal approach for form structure extraction. We extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups. Our approach achieves a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively.
arXiv Detail & Related papers (2021-07-09T12:49:34Z)
Dynamic Probabilistic Pruning: A general framework for hardware-constrained pruning at different granularities [80.06422693778141]
We propose a flexible new pruning mechanism that facilitates pruning at different granularities (weights, kernels, filters/feature maps) We refer to this algorithm as Dynamic Probabilistic Pruning (DPP) We show that DPP achieves competitive compression rates and classification accuracy when pruning common deep learning models trained on different benchmark datasets for image classification.
arXiv Detail & Related papers (2021-05-26T17:01:52Z)
HPNet: Deep Primitive Segmentation Using Hybrid Representations [51.56523135057311]
HPNet is a novel deep-learning approach for segmenting a 3D shape represented as a point cloud into primitive patches. Unlike utilizing a single feature representation, HPNet hybrid representations that combine one learned semantic descriptor, two spectral descriptors derived from predicted parameters, as well as an adjacency matrix that encodes sharp edges.
arXiv Detail & Related papers (2021-05-22T02:12:46Z)
Partitioned hybrid learning of Bayesian network structures [6.683105697884667]
We develop a novel hybrid method for Bayesian network structure learning called partitioned hybrid greedy search (pHGS) Our empirical results demonstrate the superior empirical performance of pHGS against many state-of-the-art structure learning algorithms.
arXiv Detail & Related papers (2021-03-22T21:34:52Z)
Oblique Predictive Clustering Trees [6.317966126631351]
Predictive clustering trees (PCTs) can be used to solve a variety of predictive modeling tasks, including structured output prediction. We propose oblique predictive clustering trees, capable of addressing these limitations. We experimentally evaluate the proposed methods on 60 benchmark datasets for 6 predictive modeling tasks.
arXiv Detail & Related papers (2020-07-27T14:58:23Z)
Torch-Struct: Deep Structured Prediction Library [138.5262350501951]
We introduce Torch-Struct, a library for structured prediction. Torch-Struct includes a broad collection of probabilistic structures accessed through a simple and flexible distribution-based API.
arXiv Detail & Related papers (2020-02-03T16:43:02Z)
Supervised Learning for Non-Sequential Data: A Canonical Polyadic Decomposition Approach [85.12934750565971]
Efficient modelling of feature interactions underpins supervised learning for non-sequential tasks. To alleviate this issue, it has been proposed to implicitly represent the model parameters as a tensor. For enhanced expressiveness, we generalize the framework to allow feature mapping to arbitrarily high-dimensional feature vectors.
arXiv Detail & Related papers (2020-01-27T22:38:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.