BioT5: Enriching Cross-modal Integration in Biology with Chemical
Knowledge and Natural Language Associations
- URL: http://arxiv.org/abs/2310.07276v3
- Date: Mon, 29 Jan 2024 03:34:14 GMT
- Title: BioT5: Enriching Cross-modal Integration in Biology with Chemical
Knowledge and Natural Language Associations
- Authors: Qizhi Pei, Wei Zhang, Jinhua Zhu, Kehan Wu, Kaiyuan Gao, Lijun Wu,
Yingce Xia, Rui Yan
- Abstract summary: $mathbfBioT5$ is a pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations.
$mathbfBioT5$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information.
- Score: 54.97423244799579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in biological research leverage the integration of
molecules, proteins, and natural language to enhance drug discovery. However,
current models exhibit several limitations, such as the generation of invalid
molecular SMILES, underutilization of contextual information, and equal
treatment of structured and unstructured knowledge. To address these issues, we
propose $\mathbf{BioT5}$, a comprehensive pre-training framework that enriches
cross-modal integration in biology with chemical knowledge and natural language
associations. $\mathbf{BioT5}$ utilizes SELFIES for $100%$ robust molecular
representations and extracts knowledge from the surrounding context of
bio-entities in unstructured biological literature. Furthermore,
$\mathbf{BioT5}$ distinguishes between structured and unstructured knowledge,
leading to more effective utilization of information. After fine-tuning, BioT5
shows superior performance across a wide range of tasks, demonstrating its
strong capability of capturing underlying relations and properties of
bio-entities. Our code is available at
$\href{https://github.com/QizhiPei/BioT5}{Github}$.
Related papers
- InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions [32.38318676313486]
InstructBioMol is designed to bridge natural language and biomolecules.
It can integrate multimodal biomolecules as input, and enable researchers to articulate design goals in natural language.
It can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an ESP Score of 70.4.
arXiv Detail & Related papers (2024-10-10T13:45:56Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning [77.90250740041411]
This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery.
BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data.
arXiv Detail & Related papers (2024-02-27T12:43:09Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical
Knowledge Graphs [45.53337864477857]
Know2BIO is a general-purpose heterogeneous KG benchmark for the biomedical domain.
It integrates data from 30 diverse sources, capturing intricate relationships across 11 biomedical categories.
Know2BIO is capable of user-directed automated updating to reflect the latest knowledge in biomedical science.
arXiv Detail & Related papers (2023-10-05T00:34:56Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Multi-modal Molecule Structure-text Model for Text-based Retrieval and
Editing [107.49804059269212]
We present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions.
In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts.
arXiv Detail & Related papers (2022-12-21T06:18:31Z) - SciFive: a text-to-text transformer model for biomedical literature [0.9482369543628087]
We introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora.
Our results support the exploration of more difficult text generation tasks and the development of new methods in this area.
arXiv Detail & Related papers (2021-05-28T06:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.