Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization
- URL: http://arxiv.org/abs/2510.14217v1
- Date: Thu, 16 Oct 2025 01:52:26 GMT
- Title: Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization
- Authors: Asma Jamali, Tin Sum Cheng, Rodrigo A. Vargas-Hernández,
- Abstract summary: We provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset.<n>Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy.<n>For transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance.
- Score: 3.2880869992413246
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that "richer spectra yield better generalization" and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.
Related papers
- Spectral Geometry for Deep Learning: Compression and Hallucination Detection via Random Matrix Theory [0.0]
This thesis proposes a unified framework based on spectral geometry and random matrix theory to address both problems.<n>The first contribution, EigenTrack, is a real-time method for detecting hallucinations and out-of-distribution behavior in language and vision-language models.<n>The second contribution, RMT-KD, is a principled compression method that identifies informative spectral components.
arXiv Detail & Related papers (2026-01-24T08:07:22Z) - SPECTRA: Spectral Target-Aware Graph Augmentation for Imbalanced Molecular Property Regression [45.62053904749856]
SPECTRA is a Spectral Target-Aware graph augmentation framework.<n>It generates realistic molecular graphs in the spectral domain.<n>It consistently improves error in relevant target ranges while maintaining competitive overall MAE.
arXiv Detail & Related papers (2025-11-06T21:57:21Z) - DiffSpectra: Molecular Structure Elucidation from Spectra using Diffusion Models [66.41802970528133]
Molecular structure elucidation from spectra is a foundational problem in chemistry.<n>Traditional methods rely heavily on expert interpretation and lack scalability.<n>We present DiffSpectra, a generative framework that directly infers both 2D and 3D molecular structures from multi-modal spectral data.
arXiv Detail & Related papers (2025-07-09T13:57:20Z) - Graph Neural Networks Are More Than Filters: Revisiting and Benchmarking from A Spectral Perspective [49.613774305350084]
Graph Neural Networks (GNNs) have achieved remarkable success in various graph-based learning tasks.<n>Recent studies suggest that other components such as non-linear layers may also significantly affect how GNNs process the input graph data in the spectral domain.<n>This paper introduces a comprehensive benchmark to measure and evaluate GNNs' capability in capturing and leveraging the information encoded in different frequency components of the input graph data.
arXiv Detail & Related papers (2024-12-10T04:53:53Z) - Holistic Physics Solver: Learning PDEs in a Unified Spectral-Physical Space [54.13671100638092]
Holistic Physics Mixer (HPM) is a framework for integrating spectral and physical information in a unified space.<n>We show that HPM consistently outperforms state-of-the-art methods in both accuracy and computational efficiency.
arXiv Detail & Related papers (2024-10-15T08:19:39Z) - Demystifying Spectral Bias on Real-World Data [2.3020018305241337]
Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning.<n>We consider cross-dataset learnability and show that one may use eigenvalues and eigenfunctions associated with highly idealized data measures to reveal spectral bias on complex datasets.
arXiv Detail & Related papers (2024-06-04T18:00:00Z) - Improving Expressive Power of Spectral Graph Neural Networks with Eigenvalue Correction [55.57072563835959]
We propose an eigenvalue correction strategy that can free filters from the constraints of repeated eigenvalue inputs.<n>Concretely, the proposed eigenvalue correction strategy enhances the uniform distribution of eigenvalues, and improves the fitting capacity and expressive power of filters.
arXiv Detail & Related papers (2024-01-28T08:12:00Z) - Hodge-Aware Contrastive Learning [101.56637264703058]
Simplicial complexes prove effective in modeling data with multiway dependencies.
We develop a contrastive self-supervised learning approach for processing simplicial data.
arXiv Detail & Related papers (2023-09-14T00:40:07Z) - Unsupervised Machine Learning for Exploratory Data Analysis of Exoplanet
Transmission Spectra [68.8204255655161]
We focus on unsupervised techniques for analyzing spectral data from transiting exoplanets.
We show that there is a high degree of correlation in the spectral data, which calls for appropriate low-dimensional representations.
We uncover interesting structures in the principal component basis, namely, well-defined branches corresponding to different chemical regimes.
arXiv Detail & Related papers (2022-01-07T22:26:33Z) - Spectral Analysis Network for Deep Representation Learning and Image
Clustering [53.415803942270685]
This paper proposes a new network structure for unsupervised deep representation learning based on spectral analysis.
It can identify the local similarities among images in patch level and thus more robust against occlusion.
It can learn more clustering-friendly representations and is capable to reveal the deep correlations among data samples.
arXiv Detail & Related papers (2020-09-11T05:07:15Z) - Convolutional Spectral Kernel Learning [21.595130250234646]
We build an interpretable convolutional spectral kernel network (textttCSKN) based on the inverse Fourier transform.
We derive the generalization error bounds and introduce two regularizers to improve the performance.
Experiments results on real-world datasets validate the effectiveness of the learning framework.
arXiv Detail & Related papers (2020-02-28T14:35:54Z) - Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural
Networks [17.188280334580195]
We derive analytical expressions for the generalization performance of kernel regression as a function of the number of training samples.
Our expressions apply to wide neural networks due to an equivalence between training them and kernel regression with the Neural Kernel Tangent (NTK)
We verify our theory with simulations on synthetic data and MNIST dataset.
arXiv Detail & Related papers (2020-02-07T00:03:40Z) - Persistent spectral based machine learning (PerSpect ML) for drug design [0.0]
We propose persistent spectral based machine learning (PerSpect ML) models for drug design.
We consider 11 persistent spectral variables and use them as the feature for machine learning models in protein-ligand binding affinity prediction.
Our results, for all these databases, are better than all existing models, as far as we know.
arXiv Detail & Related papers (2020-02-03T07:14:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.