SemClinBr -- a multi institutional and multi specialty semantically
annotated corpus for Portuguese clinical NLP tasks
- URL: http://arxiv.org/abs/2001.10071v1
- Date: Mon, 27 Jan 2020 20:39:32 GMT
- Title: SemClinBr -- a multi institutional and multi specialty semantically
annotated corpus for Portuguese clinical NLP tasks
- Authors: Lucas Emanuel Silva e Oliveira, Ana Carolina Peters, Adalniza Moura
Pucca da Silva, Caroline P. Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai
Cintho, Deborah Ribeiro Carvalho, Sadid A. Hasan, Claudia Maria Cabral Moro
- Abstract summary: SemClinBr is a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
This work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
- Score: 0.7311642662742726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high volume of research focusing on extracting patient's information from
electronic health records (EHR) has led to an increase in the demand for
annotated corpora, which are a very valuable resource for both the development
and evaluation of natural language processing (NLP) algorithms. The absence of
a multi-purpose clinical corpus outside the scope of the English language,
especially in Brazilian Portuguese, is glaring and severely impacts scientific
progress in the biomedical NLP field. In this study, we developed a
semantically annotated corpus using clinical texts from multiple medical
specialties, document types, and institutions. We present the following: (1) a
survey listing common aspects and lessons learned from previous research, (2) a
fine-grained annotation schema which could be replicated and guide other
annotation initiatives, (3) a web-based annotation tool focusing on an
annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation
of the annotations. The result of this work is the SemClinBr, a corpus that has
1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and
can support a variety of clinical NLP tasks and boost the EHR's secondary use
for the Portuguese language.
Related papers
- ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z) - An Empirical Evaluation of Prompting Strategies for Large Language
Models in Zero-Shot Clinical Natural Language Processing [4.758617742396169]
We present a comprehensive and systematic experimental study on prompt engineering for five clinical NLP tasks.
We assessed the prompts proposed in recent literature, including simple prefix, simple cloze, chain of thought, and anticipatory prompts.
We provide novel insights and guidelines for prompt engineering for LLMs in clinical NLP.
arXiv Detail & Related papers (2023-09-14T19:35:00Z) - Making the Most Out of the Limited Context Length: Predictive Power
Varies with Clinical Note Type and Note Section [70.37720062263176]
We propose a framework to analyze the sections with high predictive power.
Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.
arXiv Detail & Related papers (2023-07-13T20:04:05Z) - Natural Language Processing in Electronic Health Records in Relation to
Healthcare Decision-making: A Systematic Review [2.555168694997103]
Natural Language Processing is widely used to extract clinical insights from Electronic Health Records.
Lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs.
Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively.
arXiv Detail & Related papers (2023-06-22T12:10:41Z) - Development and validation of a natural language processing algorithm to
pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain.
We annotated a corpus of clinical documents according to 12 types of identifying entities.
We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - A Unified Framework of Medical Information Annotation and Extraction for
Chinese Clinical Text [1.4841452489515765]
Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques.
This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction.
arXiv Detail & Related papers (2022-03-08T03:19:16Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.