DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations
- URL: http://arxiv.org/abs/2409.19581v2
- Date: Sat, 29 Mar 2025 20:48:34 GMT
- Title: DiMB-RE: Mining the Scientific Literature for Diet-Microbiome Associations
- Authors: Gibong Hong, Veronica Hindle, Nadine M. Veasley, Hannah D. Holscher, Halil Kilicoglu,
- Abstract summary: We constructed DiMB-RE, a comprehensive corpus annotated with diet-microbiome associations.<n>We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction.
- Score: 0.10485739694839666
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Objective: To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. Materials and Methods: We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (e.g., Nutrient, Microorganism) and 13 relation types (e.g., INCREASES, IMPROVES) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked two generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. Results: DiMB-RE consists of 14,450 entities and 4,206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. Discussion: To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. NLP models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. Conclusions: DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
Related papers
- Extracting Patient History from Clinical Text: A Comparative Study of Clinical Large Language Models [3.1277841304339065]
This study evaluates the performance of clinical large language models (cLLMs) in recognizing medical history entities (MHEs)
We annotated 1,449 MHEs across 61 outpatient-related clinical notes from the MTSamples repository.
The cLLMs showed potential in reducing the time required for extracting MHEs by over 20%.
arXiv Detail & Related papers (2025-03-30T02:00:56Z) - Semi-Supervised Learning from Small Annotated Data and Large Unlabeled Data for Fine-grained PICO Entity Recognition [17.791233666137092]
Existing approaches do not distinguish the attributes of PICO entities.
This study aims to develop a named entity recognition model to extract PICO entities with fine granularities.
arXiv Detail & Related papers (2024-12-26T20:24:35Z) - Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources [13.750202656564907]
Adverse event (AE) extraction is crucial for monitoring and analyzing the safety profiles of immunizations.
This study aims to evaluate the effectiveness of large language models (LLMs) and traditional deep learning models in AE extraction.
arXiv Detail & Related papers (2024-06-26T03:56:21Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM is a 2.7 billion parameter GPT-style autoregressive model trained exclusively on PubMed abstracts and full articles.
When fine-tuned, BioMedLM can produce strong multiple-choice biomedical question-answering results competitive with larger models.
BioMedLM can also be fine-tuned to produce useful answers to patient questions on medical topics.
arXiv Detail & Related papers (2024-03-27T10:18:21Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - BioBLP: A Modular Framework for Learning on Multimodal Biomedical
Knowledge Graphs [3.780924717521521]
We propose a modular framework for learning embeddings in knowledge graphs.
It allows encoding attribute data of different modalities while also supporting entities with missing attributes.
We train models using a biomedical KG containing approximately 2 million triples.
arXiv Detail & Related papers (2023-06-06T11:49:38Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - A Distant Supervision Corpus for Extracting Biomedical Relationships
Between Chemicals, Diseases and Genes [35.372588846754645]
ChemDisGene is a new dataset for training and evaluating multi-class multi-label document-level biomedical relation extraction models.
Our dataset contains 80k biomedical research abstracts labeled with mentions of chemicals, diseases, and genes.
arXiv Detail & Related papers (2022-04-13T18:02:05Z) - BioRED: A Comprehensive Biomedical Relation Extraction Dataset [6.915371362219944]
We present BioRED, a first-of-its-kind biomedical RE corpus with multiple entity types and relation pairs.
We label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information.
Our results show that while existing approaches can reach high performance on the NER task, there is much room for improvement for the RE task.
arXiv Detail & Related papers (2022-04-08T19:23:49Z) - Fine-Tuning Large Neural Language Models for Biomedical Natural Language
Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP.
We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains.
We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z) - R-BERT-CNN: Drug-target interactions extraction from biomedical
literature [1.8814209805277506]
We present our participation for the DrugProt task BioCreative VII challenge.
Drug-target interactions (DTIs) are critical for drug discovery and repurposing.
There are >32M biomedical articles on PubMed and manually extracting DTIs from such a huge knowledge base is challenging.
arXiv Detail & Related papers (2021-10-31T22:50:33Z) - FoodChem: A food-chemical relation extraction model [0.0]
We present a new Relation Extraction (RE) model for identifying chemicals present in the composition of food entities.
The BioBERT model achieves the best results, with a macro averaged F1 score of 0.902 in the unbalanced augmentation setting.
arXiv Detail & Related papers (2021-10-05T13:07:33Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z) - Neural networks for Anatomical Therapeutic Chemical (ATC) [83.73971067918333]
We propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM)
Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature.
arXiv Detail & Related papers (2021-01-22T19:49:47Z) - Towards Incorporating Entity-specific Knowledge Graph Information in
Predicting Drug-Drug Interactions [1.14219428942199]
We propose a new method, BERTKG-DDI, to combine drug embeddings obtained from its interaction with other biomedical entities along with domain-specific BioBERT embedding-based RC architecture.
Experiments conducted on the DDIExtraction 2013 corpus clearly indicate that this strategy improves other baselines architectures by 4.1% macro F1-score.
arXiv Detail & Related papers (2020-12-21T06:44:32Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z) - Assessing Graph-based Deep Learning Models for Predicting Flash Point [52.931492216239995]
Graph-based deep learning (GBDL) models were implemented in predicting flash point for the first time.
Average R2 and Mean Absolute Error (MAE) scores of MPNN are, respectively, 2.3% lower and 2.0 K higher than previous comparable studies.
arXiv Detail & Related papers (2020-02-26T06:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.