PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for
Pathology Domain
- URL: http://arxiv.org/abs/2205.06885v1
- Date: Fri, 13 May 2022 20:42:07 GMT
- Title: PathologyBERT -- Pre-trained Vs. A New Transformer Language Model for
Pathology Domain
- Authors: Thiago Santos, Amara Tariq, Susmita Das, Kavyasree Vayalpati, Geoffrey
H. Smith, Hari Trivedi, Imon Banerjee
- Abstract summary: Successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research.
No pathology-specific language space exist to support the rapid data-mining development in pathology space.
PathologyBERT is a pre-trained masked language model which was trained on 347,173 histopathology specimen reports.
- Score: 2.3628956573813498
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Pathology text mining is a challenging task given the reporting variability
and constant new findings in cancer sub-type definitions. However, successful
text mining of a large pathology database can play a critical role to advance
'big data' cancer research like similarity-based treatment selection, case
identification, prognostication, surveillance, clinical trial screening, risk
stratification, and many others. While there is a growing interest in
developing language models for more specific clinical domains, no
pathology-specific language space exist to support the rapid data-mining
development in pathology space. In literature, a few approaches fine-tuned
general transformer models on specialized corpora while maintaining the
original tokenizer, but in fields requiring specialized terminology, these
models often fail to perform adequately. We propose PathologyBERT - a
pre-trained masked language model which was trained on 347,173 histopathology
specimen reports and publicly released in the Huggingface repository. Our
comprehensive experiments demonstrate that pre-training of transformer model on
pathology corpora yields performance improvements on Natural Language
Understanding (NLU) and Breast Cancer Diagnose Classification when compared to
nonspecific language models.
Related papers
- Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - In-context learning enables multimodal large language models to classify
cancer pathology images [0.7085801706650957]
In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates.
Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning.
Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples.
arXiv Detail & Related papers (2024-03-12T08:34:34Z) - OncoGPT: A Medical Conversational Model Tailored with Oncology Domain
Expertise on a Large Language Model Meta-AI (LLaMA) [6.486978719354015]
There is limited research on Large Language Models (LLMs) specifically addressing oncology-related queries.
We performed an extensive data collection of online question-answer interactions centered around oncology.
We observed a substantial enhancement in the model's understanding of genuine patient inquiries.
arXiv Detail & Related papers (2024-02-26T18:33:13Z) - Neural Machine Translation of Clinical Text: An Empirical Investigation
into Multilingual Pre-Trained Language Models and Transfer-Learning [6.822926897514793]
Experimental results on three subtasks including 1) clinical case (CC), 2) clinical terminology (CT), and 3) ontological concept (OC)
Our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data.
The transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish.
arXiv Detail & Related papers (2023-12-12T13:26:42Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-GPT: A Large Language Model for Radiology [74.07944784968372]
We introduce Radiology-GPT, a large language model for radiology.
It demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA.
It exhibits significant versatility in radiological diagnosis, research, and communication.
arXiv Detail & Related papers (2023-06-14T17:57:24Z) - Language Models are Few-shot Learners for Prognostic Prediction [0.4254099382808599]
We explore the use of transformers and language models in prognostic prediction for immunotherapy using real-world patients' clinical data and molecular profiles.
The study benchmarks the efficacy of baselines and language models on prognostic prediction across multiple cancer types and investigates the impact of different pretrained language models under few-shot regimes.
arXiv Detail & Related papers (2023-02-24T15:35:36Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.