IMG2SMI: Translating Molecular Structure Images to Simplified
Molecular-input Line-entry System
- URL: http://arxiv.org/abs/2109.04202v1
- Date: Fri, 3 Sep 2021 19:57:07 GMT
- Title: IMG2SMI: Translating Molecular Structure Images to Simplified
Molecular-input Line-entry System
- Authors: Daniel Campos, Heng Ji
- Abstract summary: We introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation.
IMG2SMI outperforms OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity.
We also release a new molecule prediction dataset including 81 million molecules for molecule description generation.
- Score: 29.946393284884778
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Like many scientific fields, new chemistry literature has grown at a
staggering pace, with thousands of papers released every month. A large portion
of chemistry literature focuses on new molecules and reactions between
molecules. Most vital information is conveyed through 2-D images of molecules,
representing the underlying molecules or reactions described. In order to
ensure reproducible and machine-readable molecule representations, text-based
molecule descriptors like SMILES and SELFIES were created. These text-based
molecule representations provide molecule generation but are unfortunately
rarely present in published literature. In the absence of molecule descriptors,
the generation of molecule descriptors from the 2-D images present in the
literature is necessary to understand chemistry literature at scale. Successful
methods such as Optical Structure Recognition Application (OSRA), and
ChemSchematicResolver are able to extract the locations of molecules structures
in chemistry papers and infer molecular descriptions and reactions. While
effective, existing systems expect chemists to correct outputs, making them
unsuitable for unsupervised large-scale data mining. Leveraging the task
formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a
model which leverages Deep Residual Networks for image feature extraction and
an encoder-decoder Transformer layers for molecule description generation.
Unlike previous Neural Network-based systems, IMG2SMI builds around the task of
molecule description generation, which enables IMG2SMI to outperform OSRA-based
systems by 163% in molecule similarity prediction as measured by the molecular
MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further
research on this task, we release a new molecule prediction dataset. including
81 million molecules for molecule description generation
Related papers
- Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [55.87790704067848]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules.
We introduce a module that integrates complementary information from different molecular encoders.
Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules.
arXiv Detail & Related papers (2025-02-19T05:49:10Z) - Knowledge-aware contrastive heterogeneous molecular graph learning [77.94721384862699]
We propose a paradigm shift by encoding molecular graphs into Heterogeneous Molecular Graph Learning (KCHML)
KCHML conceptualizes molecules through three distinct graph views-molecular, elemental, and pharmacological-enhanced by heterogeneous molecular graphs and a dual message-passing mechanism.
This design offers a comprehensive representation for property prediction, as well as for downstream tasks such as drug-drug interaction (DDI) prediction.
arXiv Detail & Related papers (2025-02-17T11:53:58Z) - Chemical Language Model Linker: blending text and molecules with modular adapters [2.2667044928324747]
We propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML)
ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions.
We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance.
arXiv Detail & Related papers (2024-10-26T13:40:13Z) - UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation [35.51027934845928]
We introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture.
A Vector Quantization-driven tokenizer transforms molecules into sequences of molecule tokens with causal dependency.
UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks.
arXiv Detail & Related papers (2024-08-01T18:31:31Z) - SMiCRM: A Benchmark Dataset of Mechanistic Molecular Images [0.8192907805418583]
We present a dataset designed to benchmark machine recognition capabilities of chemical molecules with arrow-pushing annotations.
This dataset includes a machine-readable molecular identity for each image as well as mechanistic arrows showing electron flow during chemical reactions.
arXiv Detail & Related papers (2024-07-25T18:52:10Z) - MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text.
MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Molecular Identification from AFM images using the IUPAC Nomenclature
and Attribute Multimodal Recurrent Neural Networks [0.0]
We present a strategy to address this challenging task using deep learning techniques.
Instead of identifying a finite number of molecules following a traditional classification approach, we define the molecular identification as an image captioning problem.
We design an architecture, composed of two multimodal recurrent neural networks, capable of identifying the structure and composition of an unknown molecule using a 3D-AFM image stack as input.
The neural network is trained to provide the name of each molecule according to the IUPAC nomenclature rules.
arXiv Detail & Related papers (2022-05-01T11:39:32Z) - Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning [68.8204255655161]
We introduce a novel framework for scalable 3D design that uses a hierarchical agent to build molecules.
In a variety of experiments, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms.
arXiv Detail & Related papers (2022-02-01T18:54:24Z) - Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation.
Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings.
Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z) - MolCLR: Molecular Contrastive Learning of Representations via Graph
Neural Networks [11.994553575596228]
MolCLR is a self-supervised learning framework for large unlabeled molecule datasets.
We propose three novel molecule graph augmentations: atom masking, bond deletion, and subgraph removal.
Our method achieves state-of-the-art performance on many challenging datasets.
arXiv Detail & Related papers (2021-02-19T17:35:18Z) - Augmenting Molecular Images with Vector Representations as a
Featurization Technique for Drug Classification [4.873362301533825]
This paper proposes the creation of molecular images captioned with binary vectors that encode information not contained in or easily understood from a molecular image alone.
We tested our method on the HIV dataset published by the Pande lab, which consists of 41,127 molecules labeled by if they inhibit the HIV virus.
arXiv Detail & Related papers (2020-08-09T04:26:16Z) - ASGN: An Active Semi-supervised Graph Neural Network for Molecular
Property Prediction [61.33144688400446]
We propose a novel framework called Active Semi-supervised Graph Neural Network (ASGN) by incorporating both labeled and unlabeled molecules.
In the teacher model, we propose a novel semi-supervised learning method to learn general representation that jointly exploits information from molecular structure and molecular distribution.
At last, we proposed a novel active learning strategy in terms of molecular diversities to select informative data during the whole framework learning.
arXiv Detail & Related papers (2020-07-07T04:22:39Z) - Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning.
GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data.
We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.