Related papers: SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks

URL: http://arxiv.org/abs/2001.10071v1
Date: Mon, 27 Jan 2020 20:39:32 GMT
Title: SemClinBr -- a multi institutional and multi specialty semantically annotated corpus for Portuguese clinical NLP tasks
Authors: Lucas Emanuel Silva e Oliveira, Ana Carolina Peters, Adalniza Moura Pucca da Silva, Caroline P. Gebeluca, Yohan Bonescki Gumiel, Lilian Mie Mukai Cintho, Deborah Ribeiro Carvalho, Sadid A. Hasan, Claudia Maria Cabral Moro
Abstract summary: SemClinBr is a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations. This work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
Score: 0.7311642662742726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The high volume of research focusing on extracting patient's information from electronic health records (EHR) has led to an increase in the demand for annotated corpora, which are a very valuable resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multi-purpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. In this study, we developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. We present the following: (1) a survey listing common aspects and lessons learned from previous research, (2) a fine-grained annotation schema which could be replicated and guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. The result of this work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations, and can support a variety of clinical NLP tasks and boost the EHR's secondary use for the Portuguese language.

Related papers

Natural Language Processing for Electronic Health Records in Scandinavian Languages: Norwegian, Swedish, and Danish [7.7320970512851614]
The study aims to perform a systematic review to assess and analyze the state-of-the-art NLP methods for the mainland Scandinavian clinical text. Out of the 113 articles, 18% focus on Norwegian clinical text, 64% (n=72) on Swedish, 10% (n=11) on Danish, and 8% (n=9) focus on more than one language.
arXiv Detail & Related papers (2025-03-24T10:47:32Z)
ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking. It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z)
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing [4.758617742396169]
We present a comprehensive and systematic experimental study on prompt engineering for five clinical NLP tasks. We assessed the prompts proposed in recent literature, including simple prefix, simple cloze, chain of thought, and anticipatory prompts. We provide novel insights and guidelines for prompt engineering for LLMs in clinical NLP.
arXiv Detail & Related papers (2023-09-14T19:35:00Z)
Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section [70.37720062263176]
We propose a framework to analyze the sections with high predictive power. Using MIMIC-III, we show that: 1) predictive power distribution is different between nursing notes and discharge notes and 2) combining different types of notes could improve performance when the context length is large.
arXiv Detail & Related papers (2023-07-13T20:04:05Z)
Natural Language Processing in Electronic Health Records in Relation to Healthcare Decision-making: A Systematic Review [2.555168694997103]
Natural Language Processing is widely used to extract clinical insights from Electronic Health Records. Lack of annotated data, automated tools, and other challenges hinder the full utilisation of NLP for EHRs. Various Machine Learning (ML), Deep Learning (DL) and NLP techniques are studied and compared to understand the limitations and opportunities in this space comprehensively.
arXiv Detail & Related papers (2023-06-22T12:10:41Z)
Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse [53.797797404164946]
The study highlights the difficulties faced in sharing tools and resources in this domain. We annotated a corpus of clinical documents according to 12 types of identifying entities. We build a hybrid system, merging the results of a deep learning model as well as manual rules.
arXiv Detail & Related papers (2023-03-23T17:17:46Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
A Unified Framework of Medical Information Annotation and Extraction for Chinese Clinical Text [1.4841452489515765]
Current state-of-the-art (SOTA) NLP models are highly integrated with deep learning techniques. This study presents an engineering framework of medical entity recognition, relation extraction and attribute extraction.
arXiv Detail & Related papers (2022-03-08T03:19:16Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark. It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification. We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z)
Benchmarking Automated Clinical Language Simplification: Dataset, Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches. We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.