Into the Single Cell Multiverse: an End-to-End Dataset for Procedural
Knowledge Extraction in Biomedical Texts
- URL: http://arxiv.org/abs/2309.01812v1
- Date: Mon, 4 Sep 2023 21:02:36 GMT
- Title: Into the Single Cell Multiverse: an End-to-End Dataset for Procedural
Knowledge Extraction in Biomedical Texts
- Authors: Ruth Dannenfelser, Jeffrey Zhong, Ran Zhang and Vicky Yao
- Abstract summary: FlaMB'e is a collection of expert-curated datasets that capture procedural knowledge in biomedical texts.
The dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology.
- Score: 2.2578044590557553
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Many of the most commonly explored natural language processing (NLP)
information extraction tasks can be thought of as evaluations of declarative
knowledge, or fact-based information extraction. Procedural knowledge
extraction, i.e., breaking down a described process into a series of steps, has
received much less attention, perhaps in part due to the lack of structured
datasets that capture the knowledge extraction process from end-to-end. To
address this unmet need, we present FlaMB\'e (Flow annotations for Multiverse
Biological entities), a collection of expert-curated datasets across a series
of complementary tasks that capture procedural knowledge in biomedical texts.
This dataset is inspired by the observation that one ubiquitous source of
procedural knowledge that is described as unstructured text is within academic
papers describing their methodology. The workflows annotated in FlaMB\'e are
from texts in the burgeoning field of single cell research, a research area
that has become notorious for the number of software tools and complexity of
workflows used. Additionally, FlaMB\'e provides, to our knowledge, the largest
manually curated named entity recognition (NER) and disambiguation (NED)
datasets for tissue/cell type, a fundamental biological entity that is critical
for knowledge extraction in the biomedical research domain. Beyond providing a
valuable dataset to enable further development of NLP models for procedural
knowledge extraction, automating the process of workflow mining also has
important implications for advancing reproducibility in biomedical research.
Related papers
- BioMNER: A Dataset for Biomedical Method Entity Recognition [25.403593761614424]
We propose a novel dataset for biomedical method entity recognition.
We employ an automated BioMethod entity recognition and information retrieval system to assist human annotation.
Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns.
arXiv Detail & Related papers (2024-06-28T16:34:24Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - EMBRE: Entity-aware Masking for Biomedical Relation Extraction [12.821610050561256]
We introduce the Entity-aware Masking for Biomedical Relation Extraction (EMBRE) method for relation extraction.
Specifically, we integrate entity knowledge into a deep neural network by pretraining the backbone model with an entity masking objective.
arXiv Detail & Related papers (2024-01-15T18:12:01Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models [1.9665865095034865]
We formulate the relation extraction task as binary classifications for large language models.
We designate the main title as the tail entity and explicitly incorporate it into the context.
Longer contents are sliced into text chunks, embedded, and retrieved with additional embedding models.
arXiv Detail & Related papers (2023-12-13T16:43:41Z) - AIONER: All-in-one scheme-based biomedical named entity recognition
using deep learning [7.427654811697884]
We present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema.
AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning.
arXiv Detail & Related papers (2022-11-30T12:35:00Z) - Machine learning in bioprocess development: From promise to practice [58.720142291102135]
Data-driven methods like machine learning (ML) approaches have a high potential to rationally explore large design spaces.
The aim of this review is to demonstrate how ML methods have been applied so far in bioprocess development.
arXiv Detail & Related papers (2022-10-04T13:48:59Z) - Federated Cycling (FedCy): Semi-supervised Federated Learning of
Surgical Phases [57.90226879210227]
FedCy is a semi-supervised learning (FSSL) method that combines FL and self-supervised learning to exploit a decentralized dataset of both labeled and unlabeled videos.
We demonstrate significant performance gains over state-of-the-art FSSL methods on the task of automatic recognition of surgical phases.
arXiv Detail & Related papers (2022-03-14T17:44:53Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Slot Filling for Biomedical Information Extraction [0.5330240017302619]
We present a slot filling approach to the task of biomedical IE.
We follow the proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model.
arXiv Detail & Related papers (2021-09-17T14:16:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.