BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
- URL: http://arxiv.org/abs/2402.17810v2
- Date: Fri, 31 May 2024 14:07:00 GMT
- Title: BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning
- Authors: Qizhi Pei, Lijun Wu, Kaiyuan Gao, Xiaozhuan Liang, Yin Fang, Jinhua Zhu, Shufang Xie, Tao Qin, Rui Yan,
- Abstract summary: This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery.
BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data.
- Score: 77.90250740041411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including \emph{3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets}, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at \url{https://github.com/QizhiPei/BioT5}.
Related papers
- InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions [32.38318676313486]
InstructBioMol is designed to bridge natural language and biomolecules.
It can integrate multimodal biomolecules as input, and enable researchers to articulate design goals in natural language.
It can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an ESP Score of 70.4.
arXiv Detail & Related papers (2024-10-10T13:45:56Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - BioT5: Enriching Cross-modal Integration in Biology with Chemical
Knowledge and Natural Language Associations [54.97423244799579]
$mathbfBioT5$ is a pre-training framework that enriches cross-modal integration in biology with chemical knowledge and natural language associations.
$mathbfBioT5$ distinguishes between structured and unstructured knowledge, leading to more effective utilization of information.
arXiv Detail & Related papers (2023-10-11T07:57:08Z) - Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical
Knowledge Graphs [45.53337864477857]
Know2BIO is a general-purpose heterogeneous KG benchmark for the biomedical domain.
It integrates data from 30 diverse sources, capturing intricate relationships across 11 biomedical categories.
Know2BIO is capable of user-directed automated updating to reflect the latest knowledge in biomedical science.
arXiv Detail & Related papers (2023-10-05T00:34:56Z) - BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER [52.79573512427998]
We present BioAug, a novel data augmentation framework for low-resource BioNER.
BioAug is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation.
We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets.
arXiv Detail & Related papers (2023-05-18T02:04:38Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - SciFive: a text-to-text transformer model for biomedical literature [0.9482369543628087]
We introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora.
Our results support the exploration of more difficult text generation tasks and the development of new methods in this area.
arXiv Detail & Related papers (2021-05-28T06:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.