BigBIO: A Framework for Data-Centric Biomedical Natural Language
Processing
- URL: http://arxiv.org/abs/2206.15076v1
- Date: Thu, 30 Jun 2022 07:15:45 GMT
- Title: BigBIO: A Framework for Data-Centric Biomedical Natural Language
Processing
- Authors: Jason Alan Fries, Leon Weber, Natasha Seelam, Gabriel Altay, Debajyoti
Datta, Samuele Garda, Myungsun Kang, Ruisi Su, Wojciech Kusa, Samuel
Cahyawijaya, Fabio Barth, Simon Ott, Matthias Samwald, Stephen Bach, Stella
Biderman, Mario S\"anger, Bo Wang, Alison Callahan, Daniel Le\'on
Peri\~n\'an, Th\'eo Gigant, Patrick Haller, Jenny Chim, Jose David Posada,
John Michael Giorgi, Karthik Rangasai Sivaraman, Marc P\`amies, Marianna
Nezhurina, Robert Martin, Michael Cullan, Moritz Freidank, Nathan Dahlberg,
Shubhanshu Mishra, Shamik Bose, Nicholas Michio Broad, Yanis Labrak, Shlok S
Deshmukh, Sid Kiblawi, Ayush Singh, Minh Chien Vu, Trishala Neeraj, Jonas
Golde, Albert Villanova del Moral, Benjamin Beilharz
- Abstract summary: We introduce BigBIO, a community library of 126+ biomedical NLP datasets.
BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata.
We discuss our process for task schema, data auditing, contribution guidelines, and outline two illustrative use cases.
- Score: 13.30221348538759
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Training and evaluating language models increasingly requires the
construction of meta-datasets --diverse collections of curated data with clear
provenance. Natural language prompting has recently lead to improved zero-shot
generalization by transforming existing, supervised datasets into a diversity
of novel pretraining tasks, highlighting the benefits of meta-dataset curation.
While successful in general-domain text, translating these data-centric
approaches to biomedical language modeling remains challenging, as labeled
biomedical datasets are significantly underrepresented in popular data hubs. To
address this challenge, we introduce BigBIO a community library of 126+
biomedical NLP datasets, currently covering 12 task categories and 10+
languages. BigBIO facilitates reproducible meta-dataset curation via
programmatic access to datasets and their metadata, and is compatible with
current platforms for prompt engineering and end-to-end few/zero shot language
model evaluation. We discuss our process for task schema harmonization, data
auditing, contribution guidelines, and outline two illustrative use cases:
zero-shot evaluation of biomedical prompts and large-scale, multi-task
learning. BigBIO is an ongoing community effort and is available at
https://github.com/bigscience-workshop/biomedical
Related papers
- BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval.
BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Exploring the Effectiveness of Instruction Tuning in Biomedical Language
Processing [19.41164870575055]
This study investigates the potential of instruction tuning for biomedical language processing.
We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples.
arXiv Detail & Related papers (2023-12-31T20:02:10Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - SynthBio: A Case Study in Human-AI Collaborative Curation of Text
Datasets [26.75449546181059]
We introduce a novel method for efficient dataset curation.
We use a large language model to provide seed generations to human raters.
We show that our dataset of fictional biographies is less noisy than WikiBio.
arXiv Detail & Related papers (2021-11-11T21:21:48Z) - Slot Filling for Biomedical Information Extraction [0.5330240017302619]
We present a slot filling approach to the task of biomedical IE.
We follow the proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model.
arXiv Detail & Related papers (2021-09-17T14:16:00Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.