Semi-Automating Knowledge Base Construction for Cancer Genetics
- URL: http://arxiv.org/abs/2005.08146v2
- Date: Tue, 26 May 2020 00:47:33 GMT
- Title: Semi-Automating Knowledge Base Construction for Cancer Genetics
- Authors: Somin Wadhwa, Kanhua Yin, Kevin S. Hughes, Byron C. Wallace
- Abstract summary: We propose models to automatically surface key elements from full-text cancer genetics articles.
We induce distant supervision over tokens and snippets in full-text articles using the manually constructed knowledge base.
- Score: 20.74608114488094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we consider the exponentially growing subarea of genetics in
cancer. The need to synthesize and centralize this evidence for dissemination
has motivated a team of physicians to manually construct and maintain a
knowledge base that distills key results reported in the literature. This is a
laborious process that entails reading through full-text articles to understand
the study design, assess study quality, and extract the reported cancer risk
estimates associated with particular hereditary cancer genes (i.e.,
penetrance). In this work, we propose models to automatically surface key
elements from full-text cancer genetics articles, with the ultimate aim of
expediting the manual workflow currently in place.
We propose two challenging tasks that are critical for characterizing the
findings reported cancer genetics studies: (i) Extracting snippets of text that
describe \emph{ascertainment mechanisms}, which in turn inform whether the
population studied may introduce bias owing to deviations from the target
population; (ii) Extracting reported risk estimates (e.g., odds or hazard
ratios) associated with specific germline mutations. The latter task may be
viewed as a joint entity tagging and relation extraction problem. To train
models for these tasks, we induce distant supervision over tokens and snippets
in full-text articles using the manually constructed knowledge base. We propose
and evaluate several model variants, including a transformer-based joint entity
and relation extraction model to extract <germline mutation, risk-estimate>}
pairs. We observe strong empirical performance, highlighting the practical
potential for such models to aid KB construction in this space. We ablate
components of our model, observing, e.g., that a joint model for <germline
mutation, risk-estimate> fares substantially better than a pipelined approach.
Related papers
- Automatic Extraction of Disease Risk Factors from Medical Publications [1.321009936753118]
We present a novel approach to automating the identification of risk factors for diseases from medical literature.
We first identify relevant articles, then classify them based on the presence of risk factor discussions, and finally extract specific risk factor information for a disease.
Our contributions include the development of a comprehensive pipeline for the automated extraction of risk factors and the compilation of several datasets.
arXiv Detail & Related papers (2024-07-10T05:17:55Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Rethinking Radiology Report Generation via Causal Inspired Counterfactual Augmentation [11.266364967223556]
Radiology Report Generation (RRG) draws attention as a vision-and-language interaction of biomedical fields.
Previous works inherited the ideology of traditional language generation tasks, aiming to generate paragraphs with high readability as reports.
Despite significant progress, the independence between diseases-a specific property of RRG-was neglected, yielding the models being confused by the co-occurrence of diseases brought on by the biased data distribution.
arXiv Detail & Related papers (2023-11-22T10:55:36Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Applying BioBERT to Extract Germline Gene-Disease Associations for Building a Knowledge Graph from the Biomedical Literature [0.0]
This paper presents SimpleGermKG, an automatic knowledge graph construction approach that connects germline genes and diseases.
For the extraction of genes and diseases, we employ BioBERT, a pre-trained BERT model on biomedical corpora.
For semantic relationships between articles, genes, and diseases, we implemented a part-whole relation approach.
Our knowledge graph contains 297 genes, 130 diseases, and 46,747 triples.
arXiv Detail & Related papers (2023-09-11T18:05:12Z) - Inducing Causal Structure for Abstractive Text Summarization [76.1000380429553]
We introduce a Structural Causal Model (SCM) to induce the underlying causal structure of the summarization data.
We propose a Causality Inspired Sequence-to-Sequence model (CI-Seq2Seq) to learn the causal representations that can mimic the causal factors.
Experimental results on two widely used text summarization datasets demonstrate the advantages of our approach.
arXiv Detail & Related papers (2023-08-24T16:06:36Z) - Comparative Performance Evaluation of Large Language Models for
Extracting Molecular Interactions and Pathway Knowledge [6.244840529371179]
understanding protein interactions and pathway knowledge is crucial for unraveling the complexities of living systems.
Existing databases provide curated biological data from literature and other sources, but their maintenance is labor-intensive.
We propose to harness the capabilities of large language models to address these issues by automatically extracting such knowledge from the relevant scientific literature.
arXiv Detail & Related papers (2023-07-17T20:01:11Z) - EPICURE Ensemble Pretrained Models for Extracting Cancer Mutations from
Literature [12.620782629498814]
EPICURE is an ensemble pre trained model equipped with a conditional random field pattern layer and a span prediction pattern layer to extract cancer mutations from text.
Experimental results on three benchmark datasets show competitive results compared to the baseline models.
arXiv Detail & Related papers (2021-06-11T09:08:15Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.