When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings
- URL: http://arxiv.org/abs/2403.12984v2
- Date: Wed, 27 Mar 2024 21:51:03 GMT
- Title: When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings
- Authors: Azmine Toushik Wasi, Ĺ erbetar Karlo, Raima Islam, Taki Hasan Rafi, Dong-Kyu Chae,
- Abstract summary: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds.
Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification?
The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types.
- Score: 5.648318448953635
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP.
Related papers
- Coarse-to-Fine Highlighting: Reducing Knowledge Hallucination in Large Language Models [58.952782707682815]
COFT is a novel method to focus on different-level key texts, thereby avoiding getting lost in lengthy contexts.
Experiments on the knowledge hallucination benchmark demonstrate the effectiveness of COFT, leading to a superior performance over $30%$ in the F1 score metric.
arXiv Detail & Related papers (2024-10-19T13:59:48Z) - DrugCLIP: Contrastive Drug-Disease Interaction For Drug Repurposing [4.969453745531116]
DrugCLIP is a contrastive learning method to learn drug and disease's interaction without negative labels.
We have curated a drug repurposing dataset based on real-world clinical trial records.
arXiv Detail & Related papers (2024-07-02T13:41:59Z) - Benchmarking Hallucination in Large Language Models based on
Unanswerable Math Word Problem [58.3723958800254]
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks.
They are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination.
This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP)
arXiv Detail & Related papers (2024-03-06T09:06:34Z) - Emerging Opportunities of Using Large Language Models for Translation
Between Drug Molecules and Indications [6.832024637226738]
We propose a new task, which is the translation between drug molecules and corresponding indications.
The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases.
arXiv Detail & Related papers (2024-02-14T21:33:13Z) - Empirical Evidence for the Fragment level Understanding on Drug
Molecular Structure of LLMs [16.508471997999496]
We investigate whether and how language models understand the chemical spatial structure from 1D sequences.
The results indicate that language models can understand chemical structures from the perspective of molecular fragments.
arXiv Detail & Related papers (2024-01-15T12:53:58Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Multi-modal Molecule Structure-text Model for Text-based Retrieval and
Editing [107.49804059269212]
We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions.
In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts.
arXiv Detail & Related papers (2022-12-21T06:18:31Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Hyperbolic Molecular Representation Learning for Drug Repositioning [19.73556079390888]
A drug hierarchy is a valuable source that encodes knowledge of relations among drugs in a tree-like structure.
Here, we develop a semi-supervised drug embedding that incorporates two sources of information.
We show that the learned drug embedding can induce the hierarchical relations among drugs.
arXiv Detail & Related papers (2022-07-06T20:20:29Z) - Neural networks for Anatomical Therapeutic Chemical (ATC) [83.73971067918333]
We propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM)
Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature.
arXiv Detail & Related papers (2021-01-22T19:49:47Z) - Semi-Supervised Hierarchical Drug Embedding in Hyperbolic Space [19.73556079390888]
A drug hierarchy is a valuable source that encodes human knowledge of drug relations in a tree-like structure.
Here, we develop a semi-supervised drug embedding that incorporates two sources of information.
We show that the learned drug embedding can be used to find new uses for existing drugs and to discover side-effects.
arXiv Detail & Related papers (2020-06-01T14:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.