Related papers: When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings

URL: http://arxiv.org/abs/2403.12984v2
Date: Wed, 27 Mar 2024 21:51:03 GMT
Title: When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings
Authors: Azmine Toushik Wasi, Šerbetar Karlo, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae,
Abstract summary: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types.
Score: 5.648318448953635
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP.

Related papers

Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z)
RFL: Simplifying Chemical Structure Recognition with Ring-Free Language [66.47173094346115]
We propose a novel Ring-Free Language (RFL) to describe chemical structures in a hierarchical form. RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness. We propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings.
arXiv Detail & Related papers (2024-12-10T15:29:32Z)
Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models [58.952782707682815]
COFT is a novel method to focus on different-level key texts, thereby avoiding getting lost in lengthy contexts. Experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30%$ in the F1 score metric.
arXiv Detail & Related papers (2024-10-19T13:59:48Z)
DrugCLIP: Contrastive Drug-Disease Interaction For Drug Repurposing [4.969453745531116]
DrugCLIP is a contrastive learning method to learn drug and disease's interaction without negative labels. We have curated a drug repurposing dataset based on real-world clinical trial records.
arXiv Detail & Related papers (2024-07-02T13:41:59Z)
Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z)
Emerging Opportunities of Using Large Language Models for Translation Between Drug Molecules and Indications [6.832024637226738]
We propose a new task, which is the translation between drug molecules and corresponding indications. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases.
arXiv Detail & Related papers (2024-02-14T21:33:13Z)
Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs [16.508471997999496]
We investigate whether and how language models understand the chemical spatial structure from 1D sequences. The results indicate that language models can understand chemical structures from the perspective of molecular fragments.
arXiv Detail & Related papers (2024-01-15T12:53:58Z)
Compositional Representation of Polymorphic Crystalline Materials [56.80318252233511]
We introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information. Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation.
arXiv Detail & Related papers (2023-11-17T20:34:28Z)
Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. Deep learning models have emerged as an efficient way to discover synergistic combinations. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z)
Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing [107.49804059269212]
We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts.
arXiv Detail & Related papers (2022-12-21T06:18:31Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Hyperbolic Molecular Representation Learning for Drug Repositioning [19.73556079390888]
A drug hierarchy is a valuable source that encodes knowledge of relations among drugs in a tree-like structure. Here, we develop a semi-supervised drug embedding that incorporates two sources of information. We show that the learned drug embedding can induce the hierarchical relations among drugs.
arXiv Detail & Related papers (2022-07-06T20:20:29Z)
Neural networks for Anatomical Therapeutic Chemical (ATC) [83.73971067918333]
We propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM) Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature.
arXiv Detail & Related papers (2021-01-22T19:49:47Z)
Semi-Supervised Hierarchical Drug Embedding in Hyperbolic Space [19.73556079390888]
A drug hierarchy is a valuable source that encodes human knowledge of drug relations in a tree-like structure. Here, we develop a semi-supervised drug embedding that incorporates two sources of information. We show that the learned drug embedding can be used to find new uses for existing drugs and to discover side-effects.
arXiv Detail & Related papers (2020-06-01T14:46:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.