NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts
- URL: http://arxiv.org/abs/2004.01092v1
- Date: Thu, 2 Apr 2020 15:51:31 GMT
- Title: NUBES: A Corpus of Negation and Uncertainty in Spanish Clinical Texts
- Authors: Salvador Lima, Naiara Perez, Montse Cuadros, and German Rigau
- Abstract summary: This paper introduces the first version of the NUBes corpus (Negation and Uncertainty annotations in Biomedical texts in Spanish)
The corpus is part of an on-going research and currently consists of 29,682 sentences obtained from anonymised health records annotated with negation and uncertainty.
- Score: 5.424799109837065
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper introduces the first version of the NUBes corpus (Negation and
Uncertainty annotations in Biomedical texts in Spanish). The corpus is part of
an on-going research and currently consists of 29,682 sentences obtained from
anonymised health records annotated with negation and uncertainty. The article
includes an exhaustive comparison with similar corpora in Spanish, and presents
the main annotation and design decisions. Additionally, we perform preliminary
experiments using deep learning algorithms to validate the annotated dataset.
As far as we know, NUBes is the largest publicly available corpus for negation
in Spanish and the first that also incorporates the annotation of speculation
cues, scopes, and events.
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains [60.5207173547769]
We evaluate zero-shot generated summaries across specialized domains including biomedical articles, and legal bills.
We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors.
We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles.
arXiv Detail & Related papers (2024-02-05T20:51:11Z) - FRACAS: A FRench Annotated Corpus of Attribution relations in newS [0.0]
We present a manually annotated corpus of 1676 newswire texts in French for quotation extraction and source attribution.
We first describe the composition of our corpus and the choices that were made in selecting the data.
We then detail our inter-annotator agreement between the 8 annotators who worked on manual labelling.
arXiv Detail & Related papers (2023-09-19T13:19:54Z) - Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
Negation [59.307534363825816]
Negation is poorly captured by current language models, although the extent of this problem is not widely understood.
We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods.
arXiv Detail & Related papers (2022-10-06T23:39:01Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Understanding Pre-trained BERT for Aspect-based Sentiment Analysis [71.40586258509394]
This paper analyzes the pre-trained hidden representations learned from reviews on BERT for tasks in aspect-based sentiment analysis (ABSA)
It is not clear how the general proxy task of (masked) language model trained on unlabeled corpus without annotations of aspects or opinions can provide important features for downstream tasks in ABSA.
arXiv Detail & Related papers (2020-10-31T02:21:43Z) - Named Entities in Medical Case Reports: Corpus and Experiments [0.5773440045183915]
We present a new corpus comprising annotations of medical entities in case reports, originating from PubMed Central's open access library.
In the case reports, we annotate cases, conditions, findings, factors and negation modifier.
This is the first corpus of this kind made available to the scientific community in English.
arXiv Detail & Related papers (2020-03-29T14:08:43Z) - SemClinBr -- a multi institutional and multi specialty semantically
annotated corpus for Portuguese clinical NLP tasks [0.7311642662742726]
SemClinBr is a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
This work is the SemClinBr, a corpus that has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
arXiv Detail & Related papers (2020-01-27T20:39:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.