PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using
Transfer Learning
- URL: http://arxiv.org/abs/2102.13139v1
- Date: Thu, 25 Feb 2021 19:36:35 GMT
- Title: PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using
Transfer Learning
- Authors: Nasi Jofche, Kostadin Mishev, Riste Stojanov, Milos Jovanovik, Dimitar
Trajanov
- Abstract summary: PharmKE is a text analysis platform that applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles.
The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks.
The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The challenge of recognizing named entities in a given text has been a very
dynamic field in recent years. This is due to the advances in neural network
architectures, increase of computing power and the availability of diverse
labeled datasets, which deliver pre-trained, highly accurate models. These
tasks are generally focused on tagging common entities, but domain-specific
use-cases require tagging custom entities which are not part of the pre-trained
models. This can be solved by either fine-tuning the pre-trained models, or by
training custom models. The main challenge lies in obtaining reliable labeled
training and test datasets, and manual labeling would be a highly tedious task.
In this paper we present PharmKE, a text analysis platform focused on the
pharmaceutical domain, which applies deep learning through several stages for
thorough semantic analysis of pharmaceutical articles. It performs text
classification using state-of-the-art transfer learning models, and thoroughly
integrates the results obtained through a proposed methodology. The methodology
is used to create accurately labeled training and test datasets, which are then
used to train models for custom entity labeling tasks, centered on the
pharmaceutical domain. The obtained results are compared to the fine-tuned BERT
and BioBERT models trained on the same dataset. Additionally, the PharmKE
platform integrates the results obtained from named entity recognition tasks to
resolve co-references of entities and analyze the semantic relations in every
sentence, thus setting up a baseline for additional text analysis tasks, such
as question answering and fact extraction. The recognized entities are also
used to expand the knowledge graph generated by DBpedia Spotlight for a given
pharmaceutical text.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Exploring the Effectiveness of Instruction Tuning in Biomedical Language
Processing [19.41164870575055]
This study investigates the potential of instruction tuning for biomedical language processing.
We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples.
arXiv Detail & Related papers (2023-12-31T20:02:10Z) - Enhancing Medical Specialty Assignment to Patients using NLP Techniques [0.0]
We propose an alternative approach that achieves superior performance while being computationally efficient.
Specifically, we utilize keywords to train a deep learning architecture that outperforms a language model pretrained on a large corpus of text.
Our results demonstrate that utilizing keywords for text classification significantly improves classification performance.
arXiv Detail & Related papers (2023-12-09T14:13:45Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Slot Filling for Biomedical Information Extraction [0.5330240017302619]
We present a slot filling approach to the task of biomedical IE.
We follow the proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reader model.
arXiv Detail & Related papers (2021-09-17T14:16:00Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Data Mining in Clinical Trial Text: Transformers for Classification and
Question Answering Tasks [2.127049691404299]
This research applies advances in natural language processing to evidence synthesis based on medical texts.
The main focus is on information characterized via the Population, Intervention, Comparator, and Outcome (PICO) framework.
Recent neural network architectures based on transformers show capacities for transfer learning and increased performance on downstream natural language processing tasks.
arXiv Detail & Related papers (2020-01-30T11:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.