A Distant Supervision Corpus for Extracting Biomedical Relationships
Between Chemicals, Diseases and Genes
- URL: http://arxiv.org/abs/2204.06584v1
- Date: Wed, 13 Apr 2022 18:02:05 GMT
- Title: A Distant Supervision Corpus for Extracting Biomedical Relationships
Between Chemicals, Diseases and Genes
- Authors: Dongxu Zhang, Sunil Mohan, Michaela Torkar, Andrew McCallum
- Abstract summary: ChemDisGene is a new dataset for training and evaluating multi-class multi-label document-level biomedical relation extraction models.
Our dataset contains 80k biomedical research abstracts labeled with mentions of chemicals, diseases, and genes.
- Score: 35.372588846754645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce ChemDisGene, a new dataset for training and evaluating
multi-class multi-label document-level biomedical relation extraction models.
Our dataset contains 80k biomedical research abstracts labeled with mentions of
chemicals, diseases, and genes, portions of which human experts labeled with 18
types of biomedical relationships between these entities (intended for
evaluation), and the remainder of which (intended for training) has been
distantly labeled via the CTD database with approximately 78\% accuracy. In
comparison to similar preexisting datasets, ours is both substantially larger
and cleaner; it also includes annotations linking mentions to their entities.
We also provide three baseline deep neural network relation extraction models
trained and evaluated on our new dataset.
Related papers
- Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking [45.16091578348614]
State-of-the-art deep learning entity linking methods rely on extensive human-labelled data.
Current datasets are limited in size, leading to inadequate coverage of biomedical concepts.
We propose to automatically generate data to create large-scale training datasets.
arXiv Detail & Related papers (2024-07-08T18:04:22Z) - BioBERT-based Deep Learning and Merged ChemProt-DrugProt for Enhanced Biomedical Relation Extraction [2.524192238862961]
Our approach integrates the ChemProt and DrugProt datasets using a novel merging strategy.
The study highlights the potential of automated information extraction in biomedical research and clinical practice.
arXiv Detail & Related papers (2024-05-28T21:34:01Z) - Biomedical Entity Linking as Multiple Choice Question Answering [48.74212158495695]
We present BioELQA, a novel model that treats Biomedical Entity Linking as Multiple Choice Question Answering.
We first obtains candidate entities with a fast retriever, jointly presents the mention and candidate entities to a generator, and then outputs the predicted symbol associated with its chosen entity.
To improve generalization for long-tailed entities, we retrieve similar labeled training instances as clues and the input with retrieved instances for the generator.
arXiv Detail & Related papers (2024-02-23T08:40:38Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - Towards Unifying Anatomy Segmentation: Automated Generation of a
Full-body CT Dataset via Knowledge Aggregation and Anatomical Guidelines [113.08940153125616]
We generate a dataset of whole-body CT scans with $142$ voxel-level labels for 533 volumes providing comprehensive anatomical coverage.
Our proposed procedure does not rely on manual annotation during the label aggregation stage.
We release our trained unified anatomical segmentation model capable of predicting $142$ anatomical structures on CT data.
arXiv Detail & Related papers (2023-07-25T09:48:13Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - BioRED: A Comprehensive Biomedical Relation Extraction Dataset [6.915371362219944]
We present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types and relation pairs.
We label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
Our results show that while existing approaches can reach high performance on the NER task, there is much room for improvement for the RE task.
arXiv Detail & Related papers (2022-04-08T19:23:49Z) - BioIE: Biomedical Information Extraction with Multi-head Attention
Enhanced Graph Convolutional Network [9.227487525657901]
We propose Biomedical Information Extraction, a hybrid neural network to extract relations from biomedical text and unstructured medical reports.
We evaluate our model on two major biomedical relationship extraction tasks, chemical-disease relation and chemical-protein interaction, and a cross-hospital pan-cancer pathology report corpus.
arXiv Detail & Related papers (2021-10-26T13:19:28Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Neural networks for Anatomical Therapeutic Chemical (ATC) [83.73971067918333]
We propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM)
Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature.
arXiv Detail & Related papers (2021-01-22T19:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.