LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation
- URL: http://arxiv.org/abs/2510.26715v1
- Date: Thu, 30 Oct 2025 17:13:58 GMT
- Title: LSM-MS2: A Foundation Model Bridging Spectral Identification and Biological Interpretation
- Authors: Gabriel Asher, Devesh Shah, Amy A. Caudy, Luke Ferro, Lea Amar, Ana S. H. Costa, Thomas Patton, Niall O'Connor, Jennifer M. Campbell, Jack Geremia,
- Abstract summary: We present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space.<n>LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds.
- Score: 0.179762320774136
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A vast majority of mass spectrometry data remains uncharacterized, leaving much of its biological and chemical information untapped. Recent advances in machine learning have begun to address this gap, particularly for tasks such as spectral identification in tandem mass spectrometry data. Here, we present the latest generation of LSM-MS2, a large-scale deep learning foundation model trained on millions of spectra to learn a semantic chemical space. LSM-MS2 achieves state-of-the-art performance in spectral identification, improving on existing methods by 30% in accuracy of identifying challenging isomeric compounds, yielding 42% more correct identifications in complex biological samples, and maintaining robustness under low-concentration conditions. Furthermore, LSM-MS2 produces rich spectral embeddings that enable direct biological interpretation from minimal downstream data, successfully differentiating disease states and predicting clinical outcomes across diverse translational applications.
Related papers
- How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - Comparative Analysis of Formula and Structure Prediction from Tandem Mass Spectra [3.2243643829769586]
Liquid chromatography mass spectrometry (LC-MS)-based metabolomics and exposomics aim to measure detectable small molecules in biological samples.<n>Findings have established realistic performance baselines, identified critical bottlenecks, and provided guidance to further improve compound predictions based on MS.
arXiv Detail & Related papers (2026-01-02T16:20:13Z) - Unmasking Airborne Threats: Guided-Transformers for Portable Aerosol Mass Spectrometry [2.743898388459522]
Matrix Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI-MS) is a cornerstone in biomolecular analysis, offering precise identification of pathogens through unique mass spectral signatures.<n>Yet, its reliance on labor-intensive sample preparation and multi-shot spectral averaging restricts its use to laboratory settings, rendering it impractical for real-time environmental monitoring.<n>These limitations are especially pronounced in emerging aerosol MALDI-MS systems, where autonomous sampling generates noisy spectra for unknown aerosol analytes.<n>We propose the Mass Spectral Dictionary-Guided Transformer (MS-DGFormer), a data-driven framework that redefines spectral
arXiv Detail & Related papers (2025-11-21T17:45:00Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation [20.973121120131875]
Large-scale pretraining has proven effective in addressing data scarcity in other domains.<n>We propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary.<n>Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1.
arXiv Detail & Related papers (2025-10-23T14:45:28Z) - A Self-supervised Learning Method for Raman Spectroscopy based on Masked Autoencoders [3.9517125314802306]
We propose a self-supervised learning paradigm for Raman spectroscopy based on a Masked AutoEncoder, termed SMAE.<n> SMAE does not require any spectral annotations during pre-training. By randomly masking and then reconstructing the spectral information, the model learns essential spectral features.
arXiv Detail & Related papers (2025-04-21T10:44:06Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
We present DiffMS, a formula-restricted encoder-decoder generative network that achieves state-of-the-art performance on this task.<n>To develop a robust decoder that bridges latent embeddings and molecular structures, we pretrain the diffusion decoder with fingerprint-structure pairs.<n>Experiments on established benchmarks show that DiffMS outperforms existing models on de novo molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - MADGEN: Mass-Spec attends to De Novo Molecular generation [16.89017809745962]
We propose a scaffold-based method for de novo molecular structure generation guided by mass spectrometry data.<n> MADGEN operates in two stages: scaffold retrieval and spectra-conditioned molecular generation.<n>We evaluate MADGEN on three datasets (NIST23, CANOPUS, and MassSpecGym)
arXiv Detail & Related papers (2025-01-03T18:54:26Z) - Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [55.74944165932666]
We introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences.<n>This dataset bridges large language models (LLMs) and complex biological sequence-related tasks, enhancing their versatility and reasoning.<n>We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction [44.27112553103388]
We present Molecule Caption Arena: the first comprehensive benchmark of large language models (LLMs)augmented molecular property prediction.
We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks.
Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations.
arXiv Detail & Related papers (2024-11-01T17:03:16Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - CLCLSA: Cross-omics Linked embedding with Contrastive Learning and Self
Attention for multi-omics integration with incomplete multi-omics data [47.2764293508916]
Integration of heterogeneous and high-dimensional multi-omics data is becoming increasingly important in understanding genetic data.
One obstacle faced when performing multi-omics data integration is the existence of unpaired multi-omics data due to instrument sensitivity and cost.
We propose a deep learning method for multi-omics integration with incomplete data by Cross-omics Linked unified embedding with Contrastive Learning and Self Attention.
arXiv Detail & Related papers (2023-04-12T00:22:18Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.