The SourceData-NLP dataset: integrating curation into scientific
publishing for training large language models
- URL: http://arxiv.org/abs/2310.20440v1
- Date: Tue, 31 Oct 2023 13:22:38 GMT
- Title: The SourceData-NLP dataset: integrating curation into scientific
publishing for training large language models
- Authors: Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Thomas Lemberger
- Abstract summary: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process.
This dataset contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology.
- Score: 1.0423199374671421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Introduction: The scientific publishing landscape is expanding rapidly,
creating challenges for researchers to stay up-to-date with the evolution of
the literature. Natural Language Processing (NLP) has emerged as a potent
approach to automating knowledge extraction from this vast amount of
publications and preprints. Tasks such as Named-Entity Recognition (NER) and
Named-Entity Linking (NEL), in conjunction with context-dependent semantic
interpretation, offer promising and complementary approaches to extracting
structured information and revealing key concepts.
Results: We present the SourceData-NLP dataset produced through the routine
curation of papers during the publication process. A unique feature of this
dataset is its emphasis on the annotation of bioentities in figure legends. We
annotate eight classes of biomedical entities (small molecules, gene products,
subcellular components, cell lines, cell types, tissues, organisms, and
diseases), their role in the experimental design, and the nature of the
experimental method as an additional class. SourceData-NLP contains more than
620,000 annotated biomedical entities, curated from 18,689 figures in 3,223
papers in molecular and cell biology. We illustrate the dataset's usefulness by
assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned
on the SourceData-NLP dataset for NER. We also introduce a novel
context-dependent semantic task that infers whether an entity is the target of
a controlled intervention or the object of measurement.
Conclusions: SourceData-NLP's scale highlights the value of integrating
curation into publishing. Models trained with SourceData-NLP will furthermore
enable the development of tools able to extract causal hypotheses from the
literature and assemble them into knowledge graphs.
Related papers
- SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Extracting Protein-Protein Interactions (PPIs) from Biomedical
Literature using Attention-based Relational Context Information [5.456047952635665]
This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels.
A Transformer-based deep learning method exploits entities' relational context information for relation representation to improve relation classification performance.
The model's performance is evaluated on four widely studied biomedical relation extraction datasets.
arXiv Detail & Related papers (2024-03-08T01:43:21Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Exploring the Effectiveness of Instruction Tuning in Biomedical Language
Processing [19.41164870575055]
This study investigates the potential of instruction tuning for biomedical language processing.
We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples.
arXiv Detail & Related papers (2023-12-31T20:02:10Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Into the Single Cell Multiverse: an End-to-End Dataset for Procedural
Knowledge Extraction in Biomedical Texts [2.2578044590557553]
FlaMB'e is a collection of expert-curated datasets that capture procedural knowledge in biomedical texts.
The dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology.
arXiv Detail & Related papers (2023-09-04T21:02:36Z) - UMLS-KGI-BERT: Data-Centric Knowledge Integration in Transformers for
Biomedical Entity Recognition [4.865221751784403]
This work contributes a data-centric paradigm for enriching the language representations of biomedical transformer-encoder LMs by extracting text sequences from the UMLS.
Preliminary results from experiments in the extension of pre-trained LMs as well as training from scratch show that this framework improves downstream performance on multiple biomedical and clinical Named Entity Recognition (NER) tasks.
arXiv Detail & Related papers (2023-07-20T18:08:34Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - EBOCA: Evidences for BiOmedical Concepts Association Ontology [55.41644538483948]
This paper proposes EBOCA, an ontology that describes (i) biomedical domain concepts and associations between them, and (ii) evidences supporting these associations.
Test data coming from a subset of DISNET and automatic association extractions from texts has been transformed to create a Knowledge Graph that can be used in real scenarios.
arXiv Detail & Related papers (2022-08-01T18:47:03Z) - Data Mining in Clinical Trial Text: Transformers for Classification and
Question Answering Tasks [2.127049691404299]
This research applies advances in natural language processing to evidence synthesis based on medical texts.
The main focus is on information characterized via the Population, Intervention, Comparator, and Outcome (PICO) framework.
Recent neural network architectures based on transformers show capacities for transfer learning and increased performance on downstream natural language processing tasks.
arXiv Detail & Related papers (2020-01-30T11:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.