SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images
- URL: http://arxiv.org/abs/2407.18338v1
- Date: Thu, 25 Jul 2024 18:52:10 GMT
- Title: SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images
- Authors: Ching Ting Leung, Yufan Chen, Hanyu Gao,
- Abstract summary: We present a dataset designed to benchmark machine recognition capabilities of chemical molecules with arrow-pushing annotations.
This dataset includes a machine-readable molecular identity for each image as well as mechanistic arrows showing electron flow during chemical reactions.
- Score: 0.8192907805418583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optical chemical structure recognition (OCSR) systems aim to extract the molecular structure information, usually in the form of molecular graph or SMILES, from images of chemical molecules. While many tools have been developed for this purpose, challenges still exist due to different types of noises that might exist in the images. Specifically, we focus on the 'arrow-pushing' diagrams, a typical type of chemical images to demonstrate electron flow in mechanistic steps. We present Structural molecular identifier of Molecular images in Chemical Reaction Mechanisms (SMiCRM), a dataset designed to benchmark machine recognition capabilities of chemical molecules with arrow-pushing annotations. Comprising 453 images, it spans a broad array of organic chemical reactions, each illustrated with molecular structures and mechanistic arrows. SMiCRM offers a rich collection of annotated molecule images for enhancing the benchmarking process for OCSR methods. This dataset includes a machine-readable molecular identity for each image as well as mechanistic arrows showing electron flow during chemical reactions. It presents a more authentic and challenging task for testing molecular recognition technologies, and achieving this task can greatly enrich the mechanisitic information in computer-extracted chemical reaction data.
Related papers
- SubGrapher: Visual Fingerprinting of Chemical Structures [46.677062201188015]
SubGrapher is a method for the visual fingerprinting of chemical structure images.
Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images.
Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions.
arXiv Detail & Related papers (2025-04-28T11:45:46Z) - Knowledge-aware contrastive heterogeneous molecular graph learning [77.94721384862699]
We propose a paradigm shift by encoding molecular graphs into Heterogeneous Molecular Graph Learning (KCHML)
KCHML conceptualizes molecules through three distinct graph views-molecular, elemental, and pharmacological-enhanced by heterogeneous molecular graphs and a dual message-passing mechanism.
This design offers a comprehensive representation for property prediction, as well as for downstream tasks such as drug-drug interaction (DDI) prediction.
arXiv Detail & Related papers (2025-02-17T11:53:58Z) - MOL-Mamba: Enhancing Molecular Representation with Structural & Electronic Insights [23.55889965960128]
We introduce MOL-Mamba, a framework that enhances molecular representation by combining structural and electronic insights.
MOL-Mamba outperforms state-of-the-art baselines across eleven chemical-biological molecular datasets.
arXiv Detail & Related papers (2024-12-21T04:48:57Z) - Learning Chemical Reaction Representation with Reactant-Product Alignment [50.28123475356234]
RAlign is a novel chemical reaction representation learning model for various organic reaction-related tasks.
By integrating atomic correspondence between reactants and products, our model discerns the molecular transformations that occur during the reaction.
We introduce a reaction-center-aware attention mechanism that enables the model to concentrate on key functional groups.
arXiv Detail & Related papers (2024-11-26T17:41:44Z) - GraphXForm: Graph transformer for computer-aided molecular design with application to extraction [73.1842164721868]
We present GraphXForm, a decoder-only graph transformer architecture, which is pretrained on existing compounds and then fine-tuned.
We evaluate it on two solvent design tasks for liquid-liquid extraction, showing that it outperforms four state-of-the-art molecular design techniques.
arXiv Detail & Related papers (2024-11-03T19:45:15Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Advancing Molecular Machine (Learned) Representations with Stereoelectronics-Infused Molecular Graphs [0.0]
We introduce a novel approach to infusing quantum-chemical-rich information into molecular graphs via stereoelectronic effects.
We show that the explicit addition of stereoelectronic interactions significantly improves the performance of molecular machine learning models.
We also show that the learned representations allow for facile stereoelectronic evaluation of previously intractable systems.
arXiv Detail & Related papers (2024-08-08T15:21:07Z) - Atom-Level Optical Chemical Structure Recognition with Limited Supervision [14.487346160322653]
We propose a new chemical structure recognition tool that delivers state-of-the-art performance.
Unlike previous approaches, our method provides atom-level localization.
Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision.
arXiv Detail & Related papers (2024-04-02T09:01:21Z) - Expanding Chemical Representation with k-mers and Fragment-based Fingerprints for Molecular Fingerprinting [4.588028371034407]
This study introduces a novel approach, combining substruct counting, $k$-mers, and Daylight-like fingerprints, to expand the representation of chemical structures in SMILES strings.
The integrated method generates comprehensive molecular embeddings that enhance discriminative power and information content.
arXiv Detail & Related papers (2024-03-28T21:36:07Z) - From molecules to scaffolds to functional groups: building context-dependent molecular representation via multi-channel learning [10.025809630976065]
This paper introduces a novel pre-training framework that learns robust and generalizable chemical knowledge.
Our approach demonstrates competitive performance across various molecular property benchmarks.
arXiv Detail & Related papers (2023-11-05T23:47:52Z) - MolGrapher: Graph-based Visual Recognition of Chemical Structures [50.13749978547401]
We introduce MolGrapher to recognize chemical structures visually.
We treat all candidate atoms and bonds as nodes and put them in a graph.
We classify atom and bond nodes in the graph with a Graph Neural Network.
arXiv Detail & Related papers (2023-08-23T16:16:11Z) - An Equivariant Generative Framework for Molecular Graph-Structure
Co-Design [54.92529253182004]
We present MolCode, a machine learning-based generative framework for underlineMolecular graph-structure underlineCo-design.
In MolCode, 3D geometric information empowers the molecular 2D graph generation, which in turn helps guide the prediction of molecular 3D structure.
Our investigation reveals that the 2D topology and 3D geometry contain intrinsically complementary information in molecule design.
arXiv Detail & Related papers (2023-04-12T13:34:22Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Graph-based Molecular Representation Learning [59.06193431883431]
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science.
Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning.
arXiv Detail & Related papers (2022-07-08T17:43:20Z) - IMG2SMI: Translating Molecular Structure Images to Simplified
Molecular-input Line-entry System [29.946393284884778]
We introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation.
IMG2SMI outperforms OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity.
We also release a new molecule prediction dataset including 81 million molecules for molecule description generation.
arXiv Detail & Related papers (2021-09-03T19:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.