EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and
Dictionary-based Named Entity Recognition from Medical Text
- URL: http://arxiv.org/abs/2304.07805v2
- Date: Thu, 7 Mar 2024 11:52:11 GMT
- Title: EasyNER: A Customizable Easy-to-Use Pipeline for Deep Learning- and
Dictionary-based Named Entity Recognition from Medical Text
- Authors: Rafsan Ahmed, Petter Berntsson, Alexander Skafte, Salma Kazemi Rashed,
Marcus Klang, Adam Barvesten, Ola Olde, William Lindholm, Antton Lamarca
Arrizabalaga, Pierre Nugues, Sonja Aits
- Abstract summary: We develop an easy-to-use end-to-end pipeline for deep learning- and dictionary-based named entity recognition.
The pipeline can access and process large medical research article collections ( CORD-19) or raw text.
The output consists of publication-ready ranked lists and graphs of detected entities and files containing the annotated texts.
- Score: 32.73124984242397
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Background Medical research generates millions of publications and it is a
great challenge for researchers to utilize this information in full since its
scale and complexity greatly surpasses human reading capabilities. Automated
text mining can help extract and connect information spread across this large
body of literature but this technology is not easily accessible to life
scientists. Results Here, we developed an easy-to-use end-to-end pipeline for
deep learning- and dictionary-based named entity recognition (NER) of typical
entities found in medical research articles, including diseases, cells,
chemicals, genes/proteins, and species. The pipeline can access and process
large medical research article collections (PubMed, CORD-19) or raw text and
incorporates a series of deep learning models fine-tuned on the HUNER corpora
collection. In addition, the pipeline can perform dictionary-based NER related
to COVID-19 and other medical topics. Users can also load their own NER models
and dictionaries to include additional entities. The output consists of
publication-ready ranked lists and graphs of detected entities and files
containing the annotated texts. An associated script allows rapid inspection of
the results for specific entities of interest. As model use cases, the pipeline
was deployed on two collections of autophagy-related abstracts from PubMed and
on the CORD19 dataset, a collection of 764 398 research article abstracts
related to COVID-19. Conclusions The NER pipeline we present is applicable in a
variety of medical research settings and makes customizable text mining
accessible to life scientists.
Related papers
- Facilitating phenotyping from clinical texts: the medkit library [1.7924255866089314]
Phenotyping consists in applying algorithms to identify individuals associated with a specific, potentially complex, trait or condition.
Because a lot of the clinical information of EHRs are lying in texts, phenotyping from text takes an important role in studies that rely on the secondary use of EHRs.
We developed an open-source Python library named medkit to facilitate the development, evaluation and reproductibility of phenotyping pipelines.
arXiv Detail & Related papers (2024-08-30T16:54:06Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models [1.9665865095034865]
We formulate the relation extraction task as binary classifications for large language models.
We designate the main title as the tail entity and explicitly incorporate it into the context.
Longer contents are sliced into text chunks, embedded, and retrieved with additional embedding models.
arXiv Detail & Related papers (2023-12-13T16:43:41Z) - CARE: Extracting Experimental Findings From Clinical Literature [29.763929941107616]
This work presents CARE, a new IE dataset for the task of extracting clinical findings.
We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes.
We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports.
arXiv Detail & Related papers (2023-11-16T10:06:19Z) - Integrating curation into scientific publishing to train AI models [1.6982459897303823]
We have embedded multimodal data curation into the academic publishing process to annotate segmented figure panels and captions.
The dataset, SourceData-NLP, contains more than 620,000 annotated biomedical entities.
We evaluate the utility of the dataset to train AI models using named-entity recognition, segmentation of figure captions into their constituent panels, and a novel context-dependent semantic task.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - Descriptive Knowledge Graph in Biomedical Domain [26.91431888505873]
We present a novel system that automatically extracts and generates informative and descriptive sentences from the biomedical corpus.
Unlike previous search engines or exploration systems that retrieve unconnected passages, our system organizes descriptive sentences as a graph.
We spotlight the application of our system in COVID-19 research, illustrating its utility in areas such as drug repurposing and literature curation.
arXiv Detail & Related papers (2023-10-18T03:10:25Z) - DiscoverPath: A Knowledge Refinement and Retrieval System for
Interdisciplinarity on Biomedical Research [96.10765714077208]
Traditional keyword-based search engines fall short in assisting users who may not be familiar with specific terminologies.
We present a knowledge graph-based paper search engine for biomedical research to enhance the user experience.
The system, dubbed DiscoverPath, employs Named Entity Recognition (NER) and part-of-speech (POS) tagging to extract terminologies and relationships from article abstracts to create a KG.
arXiv Detail & Related papers (2023-09-04T20:52:33Z) - EBOCA: Evidences for BiOmedical Concepts Association Ontology [55.41644538483948]
This paper proposes EBOCA, an ontology that describes (i) biomedical domain concepts and associations between them, and (ii) evidences supporting these associations.
Test data coming from a subset of DISNET and automatic association extractions from texts has been transformed to create a Knowledge Graph that can be used in real scenarios.
arXiv Detail & Related papers (2022-08-01T18:47:03Z) - Discovering Drug-Target Interaction Knowledge from Biomedical Literature [107.98712673387031]
The Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications.
As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from literature becomes an urgent demand in the industry.
We explore the first end-to-end solution for this task by using generative approaches.
We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations.
arXiv Detail & Related papers (2021-09-27T17:00:14Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.