Analyzing the Granularity and Cost of Annotation in Clinical Sequence
Labeling
- URL: http://arxiv.org/abs/2108.09913v1
- Date: Mon, 23 Aug 2021 03:48:27 GMT
- Title: Analyzing the Granularity and Cost of Annotation in Clinical Sequence
Labeling
- Authors: Haozhan Sun, Chenchen Xu, Hanna Suominen
- Abstract summary: Well-annotated datasets are becoming more important for researchers than ever before in supervised machine learning (ML)
We analyze the relationship between the annotation granularity and ML performance in sequence labeling using clinical records from nursing shift-change handover.
We recommend emphasizing other features, like textual knowledge, for researchers and practitioners as a cost-effective source for increasing the sequence labeling performance.
- Score: 9.143551270841858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Well-annotated datasets, as shown in recent top studies, are becoming more
important for researchers than ever before in supervised machine learning (ML).
However, the dataset annotation process and its related human labor costs
remain overlooked. In this work, we analyze the relationship between the
annotation granularity and ML performance in sequence labeling, using clinical
records from nursing shift-change handover. We first study a model derived from
textual language features alone, without additional information based on
nursing knowledge. We find that this sequence tagger performs well in most
categories under this granularity. Then, we further include the additional
manual annotations by a nurse, and find the sequence tagging performance
remaining nearly the same. Finally, we give a guideline and reference to the
community arguing it is not necessary and even not recommended to annotate in
detailed granularity because of a low Return on Investment. Therefore we
recommend emphasizing other features, like textual knowledge, for researchers
and practitioners as a cost-effective source for increasing the sequence
labeling performance.
Related papers
- Query-Guided Self-Supervised Summarization of Nursing Notes [5.835276312834499]
We introduce QGSumm, a query-guided self-supervised domain adaptation framework for nursing note summarization.
Our approach generates high-quality, patient-centered summaries without relying on reference summaries for training.
arXiv Detail & Related papers (2024-07-04T18:54:30Z) - Guidelines for Cerebrovascular Segmentation: Managing Imperfect Annotations in the context of Semi-Supervised Learning [3.231698506153459]
Supervised learning methods achieve excellent performances when fed with a sufficient amount of labeled data.
Such labels are typically highly time-consuming, error-prone and expensive to produce.
Semi-supervised learning approaches leverage both labeled and unlabeled data, and are very useful when only a small fraction of the dataset is labeled.
arXiv Detail & Related papers (2024-04-02T09:31:06Z) - Prefer to Classify: Improving Text Classifiers via Auxiliary Preference
Learning [76.43827771613127]
In this paper, we investigate task-specific preferences between pairs of input texts as a new alternative way for such auxiliary data annotation.
We propose a novel multi-task learning framework, called prefer-to-classify (P2C), which can enjoy the cooperative effect of learning both the given classification task and the auxiliary preferences.
arXiv Detail & Related papers (2023-06-08T04:04:47Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Assisted Text Annotation Using Active Learning to Achieve High Quality
with Little Effort [9.379650501033465]
We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations.
We combine an active learning (AL) approach with a pre-trained language model to semi-automatically identify annotation categories.
Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even complex and subtle frames.
arXiv Detail & Related papers (2021-12-15T13:14:58Z) - Enriched Annotations for Tumor Attribute Classification from Pathology
Reports with Limited Labeled Data [10.876391752581862]
Much of the data for patients is locked away in unstructured free-text, limiting research and delivery of effective personalized treatments.
We develop a novel enriched hierarchical annotation scheme and algorithm, Supervised Line Attention (SLA)
We apply SLA to predicting categorical tumor attributes from kidney and colon cancer pathology reports from the University of California San Francisco.
arXiv Detail & Related papers (2020-12-15T06:31:38Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z) - Learning Image Labels On-the-fly for Training Robust Classification
Models [13.669654965671604]
We show how noisy annotations (e.g., from different algorithm-based labelers) can be utilized together and mutually benefit the learning of classification tasks.
A meta-training based label-sampling module is designed to attend the labels that benefit the model learning the most through additional back-propagation processes.
arXiv Detail & Related papers (2020-09-22T05:38:44Z) - Active Learning for Coreference Resolution using Discrete Annotation [76.36423696634584]
We improve upon pairwise annotation for active learning in coreference resolution.
We ask annotators to identify mention antecedents if a presented mention pair is deemed not coreferent.
In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour.
arXiv Detail & Related papers (2020-04-28T17:17:11Z) - How Useful is Self-Supervised Pretraining for Visual Tasks? [133.1984299177874]
We evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks.
Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows.
arXiv Detail & Related papers (2020-03-31T16:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.