Related papers: Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques

URL: http://arxiv.org/abs/2505.09794v1
Date: Wed, 14 May 2025 20:44:29 GMT
Title: Automated Detection of Clinical Entities in Lung and Breast Cancer Reports Using NLP Techniques
Authors: J. Moreno-Casanova, J. M. Auñón, A. Mártinez-Pérez, M. E. Pérez-Martínez, M. E. Gas-López,
Abstract summary: We focus on lung and breast cancer due to their high incidence and the significant impact they have on public health.<n>To enhance the accuracy and efficiency of data extraction, we utilize GMV's NLP tool uQuery.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Research projects, including those focused on cancer, rely on the manual extraction of information from clinical reports. This process is time-consuming and prone to errors, limiting the efficiency of data-driven approaches in healthcare. To address these challenges, Natural Language Processing (NLP) offers an alternative for automating the extraction of relevant data from electronic health records (EHRs). In this study, we focus on lung and breast cancer due to their high incidence and the significant impact they have on public health. Early detection and effective data management in both types of cancer are crucial for improving patient outcomes. To enhance the accuracy and efficiency of data extraction, we utilized GMV's NLP tool uQuery, which excels at identifying relevant entities in clinical texts and converting them into standardized formats such as SNOMED and OMOP. uQuery not only detects and classifies entities but also associates them with contextual information, including negated entities, temporal aspects, and patient-related details. In this work, we explore the use of NLP techniques, specifically Named Entity Recognition (NER), to automatically identify and extract key clinical information from EHRs related to these two cancers. A dataset from Health Research Institute Hospital La Fe (IIS La Fe), comprising 200 annotated breast cancer and 400 lung cancer reports, was used, with eight clinical entities manually labeled using the Doccano platform. To perform NER, we fine-tuned the bsc-bio-ehr-en3 model, a RoBERTa-based biomedical linguistic model pre-trained in Spanish. Fine-tuning was performed using the Transformers architecture, enabling accurate recognition of clinical entities in these cancer types. Our results demonstrate strong overall performance, particularly in identifying entities like MET and PAT, although challenges remain with less frequent entities like EVOL.

Related papers

Natural Language Processing for Analyzing Electronic Health Records and Clinical Notes in Cancer Research: A Review [1.3966247773236926]
This review aims to analyze the application of natural language processing (NLP) techniques in cancer research using electronic health records ( EHRs) and clinical notes. Data extraction included study characteristics, cancer types, NLP methodologies, dataset information, performance metrics, challenges, and future directions.
arXiv Detail & Related papers (2024-10-29T16:17:07Z)
Boosting Medical Image-based Cancer Detection via Text-guided Supervision from Reports [68.39938936308023]
We propose a novel text-guided learning method to achieve highly accurate cancer detection results. Our approach can leverage clinical knowledge by large-scale pre-trained VLM to enhance generalization ability.
arXiv Detail & Related papers (2024-05-23T07:03:38Z)
Classifying Cancer Stage with Open-Source Clinical Large Language Models [0.35998666903987897]
Open-source clinical large language models (LLMs) can extract pathologic tumor-node-metastasis (pTNM) staging information from real-world pathology reports. Our findings suggest that while LLMs still exhibit subpar performance in Tumor (T) classification, with the appropriate adoption of prompting strategies, they can achieve comparable performance on Metastasis (M) and improved performance on Node (N) classification.
arXiv Detail & Related papers (2024-04-02T02:30:47Z)
Personalised Drug Identifier for Cancer Treatment with Transformers using Auxiliary Information [5.2992434144875515]
Cancer remains a global challenge due to its growing clinical and economic burden. genomic profiling is increasingly becoming part of clinical diagnostic panels. Effective use of such panels requires accurate drug response prediction () models, which are challenging to build due to limited labelled patient data. We present the design of a treatment recommendation system (TRS), which is currently deployed at the National University Hospital, Singapore and is being evaluated in a clinical trial.
arXiv Detail & Related papers (2024-02-16T10:29:25Z)
A new algorithm for Subgroup Set Discovery based on Information Gain [58.720142291102135]
Information Gained Subgroup Discovery (IGSD) is a new SD algorithm for pattern discovery. We compare IGSD with two state-of-the-art SD algorithms: FSSD and SSD++. IGSD provides better OR values than FSSD and SSD++, stating a higher dependence between patterns and targets.
arXiv Detail & Related papers (2023-07-26T21:42:34Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining. We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data. Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
A Marker-based Neural Network System for Extracting Social Determinants of Health [12.6970199179668]
Social determinants of health (SDoH) on patients' healthcare quality and the disparity is well-known. Many SDoH items are not coded in structured forms in electronic health records. We explore a multi-stage pipeline involving named entity recognition (NER), relation classification (RC), and text classification methods to extract SDoH information from clinical notes automatically.
arXiv Detail & Related papers (2022-12-24T18:40:23Z)
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG) CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure. Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)
Intelligent Sight and Sound: A Chronic Cancer Pain Dataset [74.77784420691937]
This paper introduces the first chronic cancer pain dataset, collected as part of the Intelligent Sight and Sound (ISS) clinical trial. The data collected to date consists of 29 patients, 509 smartphone videos, 189,999 frames, and self-reported affective and activity pain scores. Using static images and multi-modal data to predict self-reported pain levels, early models show significant gaps between current methods available to predict pain.
arXiv Detail & Related papers (2022-04-07T22:14:37Z)
Lung Cancer Lesion Detection in Histopathology Images Using Graph-Based Sparse PCA Network [93.22587316229954]
We propose a graph-based sparse principal component analysis (GS-PCA) network, for automated detection of cancerous lesions on histological lung slides stained by hematoxylin and eosin (H&E) We evaluate the performance of the proposed algorithm on H&E slides obtained from an SVM K-rasG12D lung cancer mouse model using precision/recall rates, F-score, Tanimoto coefficient, and area under the curve (AUC) of the receiver operator characteristic (ROC)
arXiv Detail & Related papers (2021-10-27T19:28:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.