Related papers: Enriched Annotations for Tumor Attribute Classification from Pathology Reports with Limited Labeled Data

Enriched Annotations for Tumor Attribute Classification from Pathology Reports with Limited Labeled Data

URL: http://arxiv.org/abs/2012.08113v1
Date: Tue, 15 Dec 2020 06:31:38 GMT
Title: Enriched Annotations for Tumor Attribute Classification from Pathology Reports with Limited Labeled Data
Authors: Nick Altieri, Briton Park, Mara Olson, John DeNero, Anobel Odisho, Bin Yu
Abstract summary: Much of the data for patients is locked away in unstructured free-text, limiting research and delivery of effective personalized treatments. We develop a novel enriched hierarchical annotation scheme and algorithm, Supervised Line Attention (SLA) We apply SLA to predicting categorical tumor attributes from kidney and colon cancer pathology reports from the University of California San Francisco.
Score: 10.876391752581862
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Precision medicine has the potential to revolutionize healthcare, but much of the data for patients is locked away in unstructured free-text, limiting research and delivery of effective personalized treatments. Generating large annotated datasets for information extraction from clinical notes is often challenging and expensive due to the high level of expertise needed for high quality annotations. To enable natural language processing for small dataset sizes, we develop a novel enriched hierarchical annotation scheme and algorithm, Supervised Line Attention (SLA), and apply this algorithm to predicting categorical tumor attributes from kidney and colon cancer pathology reports from the University of California San Francisco (UCSF). Whereas previous work only annotated document level labels, we in addition ask the annotators to enrich the traditional label by asking them to also highlight the relevant line or potentially lines for the final label, which leads to a 20% increase of annotation time required per document. With the enriched annotations, we develop a simple and interpretable machine learning algorithm that first predicts the relevant lines in the document and then predicts the tumor attribute. Our results show across the small dataset sizes of 32, 64, 128, and 186 labeled documents per cancer, SLA only requires half the number of labeled documents as state-of-the-art methods to achieve similar or better micro-f1 and macro-f1 scores for the vast majority of comparisons that we made. Accounting for the increased annotation time, this leads to a 40% reduction in total annotation time over the state of the art.

Related papers

Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports [68.39938936308023]
We propose a novel text-guided learning method to achieve highly accurate cancer detection results. Our approach can leverage clinical knowledge by large-scale pre-trained VLM to enhance generalization ability.
arXiv Detail & Related papers (2024-05-23T07:03:38Z)
Enhancing chest X-ray datasets with privacy-preserving large language models and multi-type annotations: a data-driven approach for improved classification [0.6144680854063935]
In chest X-ray (CXR) image analysis, rule-based systems are usually employed to extract labels from reports for dataset releases. We present MAPLEZ, a novel approach leveraging a locally executable Large Language Model (LLM) to extract and enhance findings labels.
arXiv Detail & Related papers (2024-03-06T20:10:41Z)
A Marker-based Neural Network System for Extracting Social Determinants of Health [12.6970199179668]
Social determinants of health (SDoH) on patients' healthcare quality and the disparity is well-known. Many SDoH items are not coded in structured forms in electronic health records. We explore a multi-stage pipeline involving named entity recognition (NER), relation classification (RC), and text classification methods to extract SDoH information from clinical notes automatically.
arXiv Detail & Related papers (2022-12-24T18:40:23Z)
Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings. We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z)
Label Cleaning Multiple Instance Learning: Refining Coarse Annotations on Single Whole-Slide Images [83.7047542725469]
Annotating cancerous regions in whole-slide images (WSIs) of pathology samples plays a critical role in clinical diagnosis, biomedical research, and machine learning algorithms development. We present a method, named Label Cleaning Multiple Instance Learning (LC-MIL), to refine coarse annotations on a single WSI without the need of external training data. Our experiments on a heterogeneous WSI set with breast cancer lymph node metastasis, liver cancer, and colorectal cancer samples show that LC-MIL significantly refines the coarse annotations, outperforming the state-of-the-art alternatives, even while learning from a single slide.
arXiv Detail & Related papers (2021-09-22T15:06:06Z)
Analyzing the Granularity and Cost of Annotation in Clinical Sequence Labeling [9.143551270841858]
Well-annotated datasets are becoming more important for researchers than ever before in supervised machine learning (ML) We analyze the relationship between the annotation granularity and ML performance in sequence labeling using clinical records from nursing shift-change handover. We recommend emphasizing other features, like textual knowledge, for researchers and practitioners as a cost-effective source for increasing the sequence labeling performance.
arXiv Detail & Related papers (2021-08-23T03:48:27Z)
A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation [50.55448707570669]
We propose a novel token-level, reference-free hallucination detection task and an associated annotated dataset named HaDes. To create this dataset, we first perturb a large number of text segments extracted from English language Wikipedia, and then verify these with crowd-sourced annotations.
arXiv Detail & Related papers (2021-04-18T04:09:48Z)
Deep Semi-supervised Metric Learning with Dual Alignment for Cervical Cancer Cell Detection [49.78612417406883]
We propose a novel semi-supervised deep metric learning method for cervical cancer cell detection. Our model learns an embedding metric space and conducts dual alignment of semantic features on both the proposal and prototype levels. We construct a large-scale dataset for semi-supervised cervical cancer cell detection for the first time, consisting of 240,860 cervical cell images.
arXiv Detail & Related papers (2021-04-07T17:11:27Z)
An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research. Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains. In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z)
Renal Cell Carcinoma Detection and Subtyping with Minimal Point-Based Annotation in Whole-Slide Images [3.488702792183152]
It is much easier and cheaper to get unlabeled data from whole-slide images. Semi-supervised learning (SSL) is an effective way to utilize unlabeled data. We propose a framework that employs an SSL method to accurately detect cancerous regions.
arXiv Detail & Related papers (2020-08-12T14:12:07Z)
Exemplar Auditing for Multi-Label Biomedical Text Classification [0.4873362301533824]
We generalize a recently proposed zero-shot sequence labeling method, "supervised labeling via a convolutional decomposition" The approach yields classification with "introspection", relating the fine-grained features of an inference-time prediction to their nearest neighbors. Our proposed approach yields both a competitively effective classification model and an interrogation mechanism to aid healthcare workers in understanding the salient features that drive the model's predictions.
arXiv Detail & Related papers (2020-04-07T02:54:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.